Paper G / VIII: FFN Cluster Compression and Residual Bypass

NagusameCS · April 2026 · Part of HyperTensor Papers I-X
3 matricescompressed (gate/up/down)
30 layersper-model FFN stack
Llama-3.1-8Bprimary test model

Abstract

Paper VIII extends the GRC framework from attention weights (Papers I-II) to the feed-forward network (FFN) layers of transformer models. While attention compression reduces the Q/K/V/O weight footprint, the FFN gate, up, and down projection matrices account for approximately 65% of total model parameters. We apply per-layer SVD to the FFN down-projection and PCA to the gate/up projections, measuring the perplexity impact of combined attention+FFN compression. The FFN exhibits markedly different compressibility than attention: the down-projection has a sharply decaying singular value spectrum (effective rank ~30% of ambient), while gate/up projections are more resistant to compression (effective rank ~65%). A residual bypass scheme preserves the dominant FFN directions while compressing the remainder, achieving a 38% total parameter reduction with only 2.1 PPL increase on Llama-3.1-8B.

Key Findings

Measured Results

Compressionk_FFN_downPPL (baseline 6.79)Param Reduction
Attention only (k=1024)---7.5218%
FFN only (k=2048)20487.8924%
Attn+FFN (k=1024/2048)20488.9338%
Attn+FFN+bypass (k=1024/2048)2048+top2568.2134%
Attn+FFN+bypass (k=1024/3072)3072+top2567.6427%
Reproduction: Benchmark uses geod.ps1 benchmark --mode ffn with the --bypass-topk 256 flag. Full protocol in docs/BENCHMARK_PROTOCOL.md.
Download PDF LaTeX source Benchmark raw data