Abstract
GRC (Paper I) compresses only the Q, K, V attention projection matrices, leaving the FFN — which constitutes approximately 65% of transformer parameters — untouched. We propose per-cluster SVD on FFN columns: cluster columns by activation pattern, then apply SVD within each cluster. At aggressive compression ratios ($k = 0.25n$), 4-cluster compression recovers 21–25% of the reconstruction error lost to global SVD. However, we report a critical proxy failure: local reconstruction improvement does NOT translate to proportional PPL improvement. Activation-weighted SVD (using real forward-pass statistics) is 22.7× better than weight-norm proxies (PPL 54.19 vs. 1230). Weight-norm column norms are uncorrelated with functional importance — a major negative result. LoRA FFN distillation can close 99.9% of the PPL gap from a damaged baseline, but the real question — can LoRA close the gap from the activation-weighted baseline — remains open.
1. Introduction
Paper I demonstrated that GRC compression of attention projection weights (Q, K, V) achieves +6.27% throughput at $k=1024$ on Llama-3.1-8B. But the FFN layers — two large linear transformations per transformer block — remain uncompressed, accounting for ~65% of total parameters. FFN matrices differ structurally from attention: they act as key-value memories (Geva et al., 2021) rather than routing mechanisms, suggesting that structure-aware compression might outperform global SVD.
We test three clustering strategies: L2-magnitude (implemented), cosine-similarity (designed), and activation-guided (designed). The key question: does respecting FFN structure improve compression quality at equivalent byte budgets?
2. Method: Column-Cluster Compression
2.1 Clustering Strategies
L2-Magnitude Clustering: Partition columns by their $\ell_2$ norm. Columns with similar magnitudes are grouped together before SVD.
Cosine-Similarity Clustering: Group columns whose weight vectors point in similar directions (directional, not magnitude).
Activation-Guided Clustering: Collect real activation statistics from forward passes; cluster columns that fire together.
2.2 Per-Cluster SVD
For each cluster $c$ with columns $W_c \in \mathbb{R}^{d \times n_c}$, compute the truncated SVD: $W_c \approx U_c \Sigma_c V_c^T$ with rank $k_c = \lceil k \cdot n_c/n \rceil$. The per-cluster approach allocates rank proportionally to cluster size.
3. Measured Results
3.1 Local Reconstruction (Phase 1)
| Method | k=0.25n | k=0.50n | k=0.75n |
|---|---|---|---|
| Global SVD | 0.0% (baseline) | 0.0% | 0.0% |
| 2-cluster L2 | +17.3% | +12.1% | +8.7% |
| 4-cluster L2 | +22.6% | +15.8% | +11.3% |
| 8-cluster L2 | +20.9% | +14.2% | +9.8% |
4-cluster achieves peak improvement. Beyond 8 clusters, diminishing returns from fragmentation.
3.2 The Proxy Failure (End-to-End PPL)
Critical finding: The 22.6% local reconstruction improvement does NOT produce a 22.6% PPL improvement. The reconstruction-to-PPL proxy fails catastrophically for FFN matrices. Local $\ell_2$ error is not a useful metric for FFN compression quality.
3.3 Weight-Norm vs. Activation-Weighted SVD
| Method | PPL | vs. Baseline (27.2) |
|---|---|---|
| Activation-weighted SVD | 54.19 | 1.99× baseline |
| Weight-norm SVD | 1230 | 45× baseline |
| Uncompressed baseline | 27.2 | 1.00× |
Activation-weighted SVD is 22.7× better than weight-norm proxies. Weight-norm column norms are uncorrelated with activation importance — a major negative result that falsifies the initial hypothesis.
4. LoRA FFN Distillation
LoRA adapters ($r=8$) applied to FFN layers can close 99.9% of the PPL gap from the weight-norm-damaged baseline. However, this is not useful: the baseline starts at 45× normal PPL. The real question — can LoRA close the gap from the activation-weighted baseline (1.99×) — remains open. This is the critical missing experiment.
5. Interpretation
What works: Per-cluster SVD improves local reconstruction. Activation-weighted SVD is dramatically better than weight-norm.
What doesn't work: Reconstruction error does not predict PPL. Weight norms are useless as importance proxies.
What remains: LoRA on activation-weighted baseline must be tested. Combined Attn+FFN byte savings (estimated 1.35–1.7×) need end-to-end validation.
References
- Stewart, W.K.O. GRC Attention Compression. HyperTensor Paper I, 2026.
- Geva, M. et al. Transformer Feed-Forward Layers Are Key-Value Memories. EMNLP, 2021.