Scope
Reproduces the FFN down-projection SVD compression results: measuring PPL degradation when the FFN down matrix (W_down) is compressed via truncated SVD at ranks r=256, 512, 1024, 2048. Validates the finding that FFN down is the most compressible component of the transformer block (less than 2% PPL increase at r=d/4).
Hardware target
- CPU-only run: ~60 seconds for a 7B model on any machine with 32GB RAM.
- GPU validation: RTX 4070 Laptop 8GB or L40S.
Prerequisites
- Python 3.10+, PyTorch 2.x, transformers, safetensors.
- GGUF model file (Q4_K_M or fp16).
scripts/check_q4_layout.pyfor matrix extraction.
Step 1: Extract and compress FFN down
python scripts/check_q4_layout.py --model ../models/qwen2.5-7b-q4_k_m.gguf \
--extract-ffn-down --output ../benchmarks/ffn_down_svd.json
Step 2: Run the SVD compression sweep
python scripts/ffn_svd_sweep.py --input ../benchmarks/ffn_down_svd.json \
--ranks 256,512,1024,2048 --output ../benchmarks/ffn_svd_results.csv
Step 3: Expected output
- r=2048 (50%): PPL increase <2%.
- r=1024 (25%): PPL increase ~5-8%.
- r=512 (12.5%): PPL increase ~15-20%.
- Singular value decay follows power law α≈0.7, confirming strong compressibility.
Validation
Compare singular value spectra against the paper's Figure 2. The SVD energy curve should show that 90% of energy is captured in the first 25% of singular values for all layers.