Structure-Aware FFN Compression via Column Clustering, HyperTensor Paper VII

Abstract

GRC (Paper I) compresses only the Q, K, V attention projection matrices, leaving the FFN — which constitutes approximately 65% of transformer parameters — untouched. We propose per-cluster SVD on FFN columns: cluster columns by activation pattern, then apply SVD within each cluster. At aggressive compression ratios ($k = 0.25n$), 4-cluster compression recovers 21–25% of the reconstruction error lost to global SVD. However, we report a critical proxy failure: local reconstruction improvement does NOT translate to proportional PPL improvement. Activation-weighted SVD (using real forward-pass statistics) is 22.7× better than weight-norm proxies (PPL 54.19 vs. 1230). Weight-norm column norms are uncorrelated with functional importance — a major negative result. LoRA FFN distillation can close 99.9% of the PPL gap from a damaged baseline, but the real question — can LoRA close the gap from the activation-weighted baseline — remains open.

1. Introduction

Paper I demonstrated that GRC compression of attention projection weights (Q, K, V) achieves +6.27% throughput at $k=1024$ on Llama-3.1-8B. But the FFN layers — two large linear transformations per transformer block — remain uncompressed, accounting for ~65% of total parameters. FFN matrices differ structurally from attention: they act as key-value memories (Geva et al., 2021) rather than routing mechanisms, suggesting that structure-aware compression might outperform global SVD.

We test three clustering strategies: L2-magnitude (implemented), cosine-similarity (designed), and activation-guided (designed). The key question: does respecting FFN structure improve compression quality at equivalent byte budgets?

2. Method: Column-Cluster Compression

2.1 Clustering Strategies

L2-Magnitude Clustering: Partition columns by their $\ell_2$ norm. Columns with similar magnitudes are grouped together before SVD.

Cosine-Similarity Clustering: Group columns whose weight vectors point in similar directions (directional, not magnitude).

Activation-Guided Clustering: Collect real activation statistics from forward passes; cluster columns that fire together.

2.2 Per-Cluster SVD

For each cluster $c$ with columns $W_c \in \mathbb{R}^{d \times n_c}$, compute the truncated SVD: $W_c \approx U_c \Sigma_c V_c^T$ with rank $k_c = \lceil k \cdot n_c/n \rceil$. The per-cluster approach allocates rank proportionally to cluster size.

3. Measured Results

3.1 Local Reconstruction (Phase 1)

Method	k=0.25n	k=0.50n	k=0.75n
Global SVD	0.0% (baseline)	0.0%	0.0%
2-cluster L2	+17.3%	+12.1%	+8.7%
4-cluster L2	+22.6%	+15.8%	+11.3%
8-cluster L2	+20.9%	+14.2%	+9.8%

4-cluster achieves peak improvement. Beyond 8 clusters, diminishing returns from fragmentation.

3.2 The Proxy Failure (End-to-End PPL)

Critical finding: The 22.6% local reconstruction improvement does NOT produce a 22.6% PPL improvement. The reconstruction-to-PPL proxy fails catastrophically for FFN matrices. Local $\ell_2$ error is not a useful metric for FFN compression quality.

3.3 Weight-Norm vs. Activation-Weighted SVD

Method	PPL	vs. Baseline (27.2)
Activation-weighted SVD	54.19	1.99× baseline
Weight-norm SVD	1230	45× baseline
Uncompressed baseline	27.2	1.00×

Activation-weighted SVD is 22.7× better than weight-norm proxies. Weight-norm column norms are uncorrelated with activation importance — a major negative result that falsifies the initial hypothesis.

4. LoRA FFN Distillation

LoRA adapters ($r=8$) applied to FFN layers can close 99.9% of the PPL gap from the weight-norm-damaged baseline. However, this is not useful: the baseline starts at 45× normal PPL. The real question — can LoRA close the gap from the activation-weighted baseline (1.99×) — remains open. This is the critical missing experiment.

5. Interpretation

What works: Per-cluster SVD improves local reconstruction. Activation-weighted SVD is dramatically better than weight-norm.

What doesn't work: Reconstruction error does not predict PPL. Weight norms are useless as importance proxies.

What remains: LoRA on activation-weighted baseline must be tested. Combined Attn+FFN byte savings (estimated 1.35–1.7×) need end-to-end validation.

References

Stewart, W.K.O. GRC Attention Compression. HyperTensor Paper I, 2026.
Geva, M. et al. Transformer Feed-Forward Layers Are Key-Value Memories. EMNLP, 2021.