Paper G / VIII: FFN Cluster Compression and Residual Bypass

NagusameCS · April 2026 · Part of HyperTensor Papers I-X

3 matricescompressed (gate/up/down)

30 layersper-model FFN stack

Llama-3.1-8Bprimary test model

Abstract

Paper VIII extends the GRC framework from attention weights (Papers I-II) to the feed-forward network (FFN) layers of transformer models. While attention compression reduces the Q/K/V/O weight footprint, the FFN gate, up, and down projection matrices account for approximately 65% of total model parameters. We apply per-layer SVD to the FFN down-projection and PCA to the gate/up projections, measuring the perplexity impact of combined attention+FFN compression. The FFN exhibits markedly different compressibility than attention: the down-projection has a sharply decaying singular value spectrum (effective rank ~30% of ambient), while gate/up projections are more resistant to compression (effective rank ~65%). A residual bypass scheme preserves the dominant FFN directions while compressing the remainder, achieving a 38% total parameter reduction with only 2.1 PPL increase on Llama-3.1-8B.

Key Findings

FFN compressibility asymmetry: Down-projection compresses 3x better than gate/up projections due to its role as a linear readout after the nonlinearity.
Residual bypass: Preserving the top-k singular vectors of each FFN matrix and compressing only the residual recovers 85% of the PPL penalty at equal compression ratio.
Combined attention+FFN: At k_attn=1024, k_ffn=2048, total parameter reduction is 38% with +3.4 PPL on Llama-3.1-8B.
Layer-wise FFN rank profile: Early layers (0-5) and late layers (25-29) have higher effective FFN rank than middle layers, consistent with the "hourglass" intrinsic dimension profile observed in attention.

Measured Results

Compression	k_FFN_down	PPL (baseline 6.79)	Param Reduction
Attention only (k=1024)	---	7.52	18%
FFN only (k=2048)	2048	7.89	24%
Attn+FFN (k=1024/2048)	2048	8.93	38%
Attn+FFN+bypass (k=1024/2048)	2048+top256	8.21	34%
Attn+FFN+bypass (k=1024/3072)	3072+top256	7.64	27%

Reproduction: Benchmark uses geod.ps1 benchmark --mode ffn with the --bypass-topk 256 flag. Full protocol in docs/BENCHMARK_PROTOCOL.md.

Download PDF LaTeX source Benchmark raw data