Abstract
GRC (Paper I) compresses only 3/7 weight matrices; FFN comprises ~65% of bytes. We propose per-cluster SVD on FFN columns: cluster by activation pattern, then SVD within each cluster. At k=0.25n, 4-cluster compression recovers 22.6% error vs. global SVD. But reconstruction error does NOT predict PPL — a critical proxy failure. Activation-weighted SVD is 22.7× better than weight-norm (PPL 54.19 vs. 1230). Weight-norm column norms are uncorrelated with functional importance. LoRA FFN distillation closes 99.9% of gap from damaged baseline, but the real question — LoRA on activation-weighted baseline — remains open.
1. Key Results
| Metric | Value |
|---|---|
| 4-cluster error improvement | +22.6% at k=0.25n |
| Activation-weighted SVD (PPL) | 54.19 (1.99× baseline) |
| Weight-norm SVD (PPL) | 1230 (45× baseline) |
| LoRA gap closure | 99.9% (from damaged baseline) |
2. Critical Finding
Reconstruction error does not predict PPL for FFN matrices. Weight-norm column norms are uncorrelated with activation importance — the weight-norm proxy is falsified. The key missing experiment: apply LoRA to the activation-weighted baseline (PPL 1.99×) to test whether the gap can be closed to <1.30×.
References
- Stewart, W.K.O. GRC Attention Compression. HyperTensor Paper I, 2026.
- Geva, M. et al. Transformer FFN Layers Are Key-Value Memories. EMNLP 2021.