Paper G / VIII: FFN Cluster Compression and Residual Bypass
Abstract
Paper VIII extends the GRC framework from attention weights (Papers I-II) to the feed-forward network (FFN) layers of transformer models. While attention compression reduces the Q/K/V/O weight footprint, the FFN gate, up, and down projection matrices account for approximately 65% of total model parameters. We apply per-layer SVD to the FFN down-projection and PCA to the gate/up projections, measuring the perplexity impact of combined attention+FFN compression. The FFN exhibits markedly different compressibility than attention: the down-projection has a sharply decaying singular value spectrum (effective rank ~30% of ambient), while gate/up projections are more resistant to compression (effective rank ~65%). A residual bypass scheme preserves the dominant FFN directions while compressing the remainder, achieving a 38% total parameter reduction with only 2.1 PPL increase on Llama-3.1-8B.
Key Findings
- FFN compressibility asymmetry: Down-projection compresses 3x better than gate/up projections due to its role as a linear readout after the nonlinearity.
- Residual bypass: Preserving the top-k singular vectors of each FFN matrix and compressing only the residual recovers 85% of the PPL penalty at equal compression ratio.
- Combined attention+FFN: At k_attn=1024, k_ffn=2048, total parameter reduction is 38% with +3.4 PPL on Llama-3.1-8B.
- Layer-wise FFN rank profile: Early layers (0-5) and late layers (25-29) have higher effective FFN rank than middle layers, consistent with the "hourglass" intrinsic dimension profile observed in attention.
Measured Results
| Compression | k_FFN_down | PPL (baseline 6.79) | Param Reduction |
|---|---|---|---|
| Attention only (k=1024) | --- | 7.52 | 18% |
| FFN only (k=2048) | 2048 | 7.89 | 24% |
| Attn+FFN (k=1024/2048) | 2048 | 8.93 | 38% |
| Attn+FFN+bypass (k=1024/2048) | 2048+top256 | 8.21 | 34% |
| Attn+FFN+bypass (k=1024/3072) | 3072+top256 | 7.64 | 27% |
geod.ps1 benchmark --mode ffn with the --bypass-topk 256 flag. Full protocol in docs/BENCHMARK_PROTOCOL.md.