Abstract
GRC compression (Paper I) achieves +6.27% throughput at $k=1024$ but incurs a +61.4% perplexity penalty. This paper proposes an optional, opt-in LoRA distillation step to recover the lost perplexity. A three-phase protocol — (1) GRC projection unchanged, (2) teacher-student LoRA correction with rank $r=8$, 500 AdamW steps on WikiText-2, (3) merge and ship — recovers 107% of the PPL gap at $k=512$ on SmolLM2-135M, 98.3% at $k=1024$, and 47.7% at $k=256$. Per-matrix SVD reduces Frobenius error by +79.4% at $k=256$ compared to shared-basis GRC. Three merge strategies are documented, all preserving the GRC fusion path. A first-order bound on recoverable PPL gap via the recoverable-energy ratio $\rho$ is derived and empirically validated.
1. Three-Phase Protocol
Phase 1: GRC Projection (unchanged from Paper I)
Compute the joint Gram matrix, eigendecomposition, keep top-$k$ eigenvectors. Project $Q, K, V$ weights into the $k$-dimensional subspace. This step is calibration-free and data-free.
Phase 2: LoRA Distillation
Train rank-$r$ LoRA adapters ($r=8$) on top of the GRC-compressed weights. Teacher = uncompressed model. Student = GRC-compressed + LoRA. Loss = KL divergence of output logits. 500 AdamW steps on WikiText-2. Trainable parameters: $3 \cdot r \cdot d \cdot L \approx 3.1\text{M}$ for Llama-8B — two orders of magnitude below full fine-tuning.
Phase 3: Merge and Ship
The LoRA factors are folded into the compressed weights: $\hat{W} = W' + AB$. The merged weights preserve the GRC fusion path — no inference overhead. The LoRA factors are quantized and shipped alongside the compressed model.
2. Measured Results
| Model | $k$ | GRC PPL | Distilled PPL | Recovery |
|---|---|---|---|---|
| SmolLM2-135M | 512 | 5.21 | 5.05 | 107.1% |
| SmolLM2-135M | 1024 | 5.15 | 5.14 | 98.3% |
| SmolLM2-135M | 256 | 5.21 | 5.21 | 47.7% |
| Llama-3.1-8B | 1024 | 10.96 | — | pending |
Key result: at $k=512$, the distilled model actually beats the uncompressed baseline (PPL 5.21 → 5.05). The LoRA correction more than compensates for GRC's compression loss at this operating point.
Recoverable-Energy Ratio
$\rho = \frac{\sum_{\ell,X} \eta_{\text{recov}}^{(\ell,X)}}{\sum_{\ell,X} \eta^{(\ell,X)}}$ where $\eta$ is the Frobenius residual and $\eta_{\text{recov}}$ is the recoverable portion at rank $r$. Measured $\rho$ = 0.1340 (Llama-8B, $k=1024$, $r=8$), 0.3443 (SmolLM2-135M, $k=256$, $r=8$).
3. Merge Strategies
Strategy A (Fused): LoRA weights embedded in GGUF tensor, applied at load time. Zero inference overhead.
Strategy B (Side): Separate LoRA file, loaded as additive correction. Enables hot-swapping corrections.
Strategy C (Baked): LoRA folded directly into weight values in GGUF. Permanent, no runtime cost.
References
- Stewart, W.K.O. GRC Attention Compression. HyperTensor Paper I, 2026.
- Hu, E.J. et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR, 2022.