Reproduce Paper G: FFN Compression, HyperTensor Research

Scope

Reproduces the FFN down-projection SVD compression results: measuring PPL degradation when the FFN down matrix (W_down) is compressed via truncated SVD at ranks r=256, 512, 1024, 2048. Validates the finding that FFN down is the most compressible component of the transformer block (less than 2% PPL increase at r=d/4).

Hardware target

CPU-only run: ~60 seconds for a 7B model on any machine with 32GB RAM.
GPU validation: RTX 4070 Laptop 8GB or L40S.

Prerequisites

Python 3.10+, PyTorch 2.x, transformers, safetensors.
GGUF model file (Q4_K_M or fp16).
scripts/check_q4_layout.py for matrix extraction.

Step 1: Extract and compress FFN down

python scripts/check_q4_layout.py --model ../models/qwen2.5-7b-q4_k_m.gguf \
    --extract-ffn-down --output ../benchmarks/ffn_down_svd.json

Step 2: Run the SVD compression sweep

python scripts/ffn_svd_sweep.py --input ../benchmarks/ffn_down_svd.json \
    --ranks 256,512,1024,2048 --output ../benchmarks/ffn_svd_results.csv

Step 3: Expected output

r=2048 (50%): PPL increase <2%.
r=1024 (25%): PPL increase ~5-8%.
r=512 (12.5%): PPL increase ~15-20%.
Singular value decay follows power law α≈0.7, confirming strong compressibility.

Validation

Compare singular value spectra against the paper's Figure 2. The SVD energy curve should show that 90% of energy is captured in the first 25% of singular values for all layers.