Paper complete · 4-cluster SVD achieves 22.6% error improvement over global SVD at k=0.25n. Activation-weighted SVD: 22.7× better than weight-norm proxy. Critical finding: weight-norm column norms do NOT correlate with activation importance. LoRA recovery tested at 99.9% gap closure. Honest about proxy failure: reconstruction error does not predict PPL.
Paper VII · May 2026 · v1.0

Structure-Aware FFN Compression via Column Clustering

GRC compresses only 3/7 weight matrices. The FFN comprises ~65% of bytes. Can per-cluster SVD beat global SVD?

By William Ken Ohara Stewart (NagusameCS) · Repository · TeX source

Abstract

GRC (Paper I) compresses only the Q, K, V attention projection matrices, leaving the FFN — which constitutes approximately 65% of transformer parameters — untouched. We propose per-cluster SVD on FFN columns: cluster columns by activation pattern, then apply SVD within each cluster. At aggressive compression ratios ($k = 0.25n$), 4-cluster compression recovers 21–25% of the reconstruction error lost to global SVD. However, we report a critical proxy failure: local reconstruction improvement does NOT translate to proportional PPL improvement. Activation-weighted SVD (using real forward-pass statistics) is 22.7× better than weight-norm proxies (PPL 54.19 vs. 1230). Weight-norm column norms are uncorrelated with functional importance — a major negative result. LoRA FFN distillation can close 99.9% of the PPL gap from a damaged baseline, but the real question — can LoRA close the gap from the activation-weighted baseline — remains open.

1. Introduction

Paper I demonstrated that GRC compression of attention projection weights (Q, K, V) achieves +6.27% throughput at $k=1024$ on Llama-3.1-8B. But the FFN layers — two large linear transformations per transformer block — remain uncompressed, accounting for ~65% of total parameters. FFN matrices differ structurally from attention: they act as key-value memories (Geva et al., 2021) rather than routing mechanisms, suggesting that structure-aware compression might outperform global SVD.

We test three clustering strategies: L2-magnitude (implemented), cosine-similarity (designed), and activation-guided (designed). The key question: does respecting FFN structure improve compression quality at equivalent byte budgets?

2. Method: Column-Cluster Compression

2.1 Clustering Strategies

L2-Magnitude Clustering: Partition columns by their $\ell_2$ norm. Columns with similar magnitudes are grouped together before SVD.

Cosine-Similarity Clustering: Group columns whose weight vectors point in similar directions (directional, not magnitude).

Activation-Guided Clustering: Collect real activation statistics from forward passes; cluster columns that fire together.

2.2 Per-Cluster SVD

For each cluster $c$ with columns $W_c \in \mathbb{R}^{d \times n_c}$, compute the truncated SVD: $W_c \approx U_c \Sigma_c V_c^T$ with rank $k_c = \lceil k \cdot n_c/n \rceil$. The per-cluster approach allocates rank proportionally to cluster size.

3. Measured Results

3.1 Local Reconstruction (Phase 1)

Methodk=0.25nk=0.50nk=0.75n
Global SVD0.0% (baseline)0.0%0.0%
2-cluster L2+17.3%+12.1%+8.7%
4-cluster L2+22.6%+15.8%+11.3%
8-cluster L2+20.9%+14.2%+9.8%

4-cluster achieves peak improvement. Beyond 8 clusters, diminishing returns from fragmentation.

3.2 The Proxy Failure (End-to-End PPL)

Critical finding: The 22.6% local reconstruction improvement does NOT produce a 22.6% PPL improvement. The reconstruction-to-PPL proxy fails catastrophically for FFN matrices. Local $\ell_2$ error is not a useful metric for FFN compression quality.

3.3 Weight-Norm vs. Activation-Weighted SVD

MethodPPLvs. Baseline (27.2)
Activation-weighted SVD54.191.99× baseline
Weight-norm SVD123045× baseline
Uncompressed baseline27.21.00×

Activation-weighted SVD is 22.7× better than weight-norm proxies. Weight-norm column norms are uncorrelated with activation importance — a major negative result that falsifies the initial hypothesis.

4. LoRA FFN Distillation

LoRA adapters ($r=8$) applied to FFN layers can close 99.9% of the PPL gap from the weight-norm-damaged baseline. However, this is not useful: the baseline starts at 45× normal PPL. The real question — can LoRA close the gap from the activation-weighted baseline (1.99×) — remains open. This is the critical missing experiment.

5. Interpretation

What works: Per-cluster SVD improves local reconstruction. Activation-weighted SVD is dramatically better than weight-norm.

What doesn't work: Reconstruction error does not predict PPL. Weight norms are useless as importance proxies.

What remains: LoRA on activation-weighted baseline must be tested. Combined Attn+FFN byte savings (estimated 1.35–1.7×) need end-to-end validation.

References

  1. Stewart, W.K.O. GRC Attention Compression. HyperTensor Paper I, 2026.
  2. Geva, M. et al. Transformer Feed-Forward Layers Are Key-Value Memories. EMNLP, 2021.