Paper 07 · May 2026

Structure-Aware FFN Compression via Column Clustering

William Ken Ohara Stewart

HyperTensor Project · Extended version · TeX source

Abstract

GRC (Paper I) compresses only 3/7 weight matrices; FFN comprises ~65% of bytes. We propose per-cluster SVD on FFN columns: cluster by activation pattern, then SVD within each cluster. At k=0.25n, 4-cluster compression recovers 22.6% error vs. global SVD. But reconstruction error does NOT predict PPL — a critical proxy failure. Activation-weighted SVD is 22.7× better than weight-norm (PPL 54.19 vs. 1230). Weight-norm column norms are uncorrelated with functional importance. LoRA FFN distillation closes 99.9% of gap from damaged baseline, but the real question — LoRA on activation-weighted baseline — remains open.

1. Key Results

MetricValue
4-cluster error improvement+22.6% at k=0.25n
Activation-weighted SVD (PPL)54.19 (1.99× baseline)
Weight-norm SVD (PPL)1230 (45× baseline)
LoRA gap closure99.9% (from damaged baseline)

2. Critical Finding

Reconstruction error does not predict PPL for FFN matrices. Weight-norm column norms are uncorrelated with activation importance — the weight-norm proxy is falsified. The key missing experiment: apply LoRA to the activation-weighted baseline (PPL 1.99×) to test whether the gap can be closed to <1.30×.

References

  1. Stewart, W.K.O. GRC Attention Compression. HyperTensor Paper I, 2026.
  2. Geva, M. et al. Transformer FFN Layers Are Key-Value Memories. EMNLP 2021.