Paper V · May 2026 · v1.0

GRC Light Distillation for Perplexity Recovery

GRC compression saves bytes but costs perplexity. Can a tiny LoRA adapter recover the gap?

By William Ken Ohara Stewart (NagusameCS) · Repository · TeX source

Abstract

GRC compression (Paper I) achieves +6.27% throughput at $k=1024$ but incurs a +61.4% perplexity penalty. This paper proposes an optional, opt-in LoRA distillation step to recover the lost perplexity. A three-phase protocol — (1) GRC projection unchanged, (2) teacher-student LoRA correction with rank $r=8$, 500 AdamW steps on WikiText-2, (3) merge and ship — recovers 107% of the PPL gap at $k=512$ on SmolLM2-135M, 98.3% at $k=1024$, and 47.7% at $k=256$. Per-matrix SVD reduces Frobenius error by +79.4% at $k=256$ compared to shared-basis GRC. Three merge strategies are documented, all preserving the GRC fusion path. A first-order bound on recoverable PPL gap via the recoverable-energy ratio $\rho$ is derived and empirically validated.

1. Three-Phase Protocol

Phase 1: GRC Projection (unchanged from Paper I)

Compute the joint Gram matrix, eigendecomposition, keep top-$k$ eigenvectors. Project $Q, K, V$ weights into the $k$-dimensional subspace. This step is calibration-free and data-free.

Phase 2: LoRA Distillation

Train rank-$r$ LoRA adapters ($r=8$) on top of the GRC-compressed weights. Teacher = uncompressed model. Student = GRC-compressed + LoRA. Loss = KL divergence of output logits. 500 AdamW steps on WikiText-2. Trainable parameters: $3 \cdot r \cdot d \cdot L \approx 3.1\text{M}$ for Llama-8B — two orders of magnitude below full fine-tuning.

Phase 3: Merge and Ship

The LoRA factors are folded into the compressed weights: $\hat{W} = W' + AB$. The merged weights preserve the GRC fusion path — no inference overhead. The LoRA factors are quantized and shipped alongside the compressed model.

2. Measured Results

Model$k$GRC PPLDistilled PPLRecovery
SmolLM2-135M5125.215.05107.1%
SmolLM2-135M10245.155.1498.3%
SmolLM2-135M2565.215.2147.7%
Llama-3.1-8B102410.96pending

Key result: at $k=512$, the distilled model actually beats the uncompressed baseline (PPL 5.21 → 5.05). The LoRA correction more than compensates for GRC's compression loss at this operating point.

Recoverable-Energy Ratio

$\rho = \frac{\sum_{\ell,X} \eta_{\text{recov}}^{(\ell,X)}}{\sum_{\ell,X} \eta^{(\ell,X)}}$ where $\eta$ is the Frobenius residual and $\eta_{\text{recov}}$ is the recoverable portion at rank $r$. Measured $\rho$ = 0.1340 (Llama-8B, $k=1024$, $r=8$), 0.3443 (SmolLM2-135M, $k=256$, $r=8$).

3. Merge Strategies

Strategy A (Fused): LoRA weights embedded in GGUF tensor, applied at load time. Zero inference overhead.

Strategy B (Side): Separate LoRA file, loaded as additive correction. Enables hot-swapping corrections.

Strategy C (Baked): LoRA folded directly into weight values in GGUF. Permanent, no runtime cost.

References

  1. Stewart, W.K.O. GRC Attention Compression. HyperTensor Paper I, 2026.
  2. Hu, E.J. et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR, 2022.