Paper F / VII: Task-Level Impact Analysis of Low-Rank Compression
Abstract
Low-rank compression of attention weights reduces the parameter count of a transformer, but the impact on downstream task performance is not uniform. Paper VII measures the per-task degradation curve for 8 common benchmarks (MMLU, HellaSwag, ARC-Challenge, PIQA, WinoGrande, GSM8K, TruthfulQA, OpenBookQA) under GRC compression at k = 256, 512, 768, 1024, 1536, and 2048 on Llama-3.1-8B-Instruct. We find that reasoning-heavy tasks (GSM8K, ARC-Challenge) degrade faster than fact-retrieval tasks (MMLU, TruthfulQA) as rank decreases, and that the degradation follows a power law in k with task-specific exponents. A single task (PIQA) shows no significant degradation even at k=256, suggesting that certain capabilities are robust to compression because they depend on shallow, low-rank features. We provide a task-level impact matrix that allows practitioners to choose a rank based on their accuracy budget per downstream task.
Key Findings
- Power-law degradation: Task accuracy follows
acc(k) = acc(inf) - c * k^{-alpha}with task-specific alpha (0.3 to 1.1). - PIQA immunity: PIQA accuracy is within 0.5% of baseline at all tested ranks, suggesting commonsense physical reasoning is encoded in very low-rank features.
- GSM8K vulnerability: Mathematical reasoning degrades fastest (alpha ≈ 1.1), consistent with the hypothesis that multi-step reasoning requires higher-rank attention subspaces.
- Task impact matrix: A lookup table mapping (task, rank) to expected accuracy, usable as a design tool for deployment trade-offs.
Measured Results
| Task | k=2048 | k=1536 | k=1024 | k=768 | k=512 | k=256 |
|---|---|---|---|---|---|---|
| MMLU | 65.2% | 64.8% | 63.1% | 61.4% | 58.2% | 52.1% |
| HellaSwag | 78.9% | 78.2% | 76.5% | 74.1% | 70.3% | 64.8% |
| ARC-Challenge | 54.3% | 53.1% | 50.2% | 47.5% | 42.8% | 36.1% |
| PIQA | 80.1% | 80.0% | 79.8% | 79.7% | 79.5% | 79.2% |
| WinoGrande | 73.4% | 72.8% | 71.2% | 68.9% | 64.5% | 58.3% |
| GSM8K | 42.1% | 38.5% | 31.2% | 25.8% | 18.4% | 9.7% |
| TruthfulQA | 54.8% | 53.9% | 51.7% | 49.2% | 45.1% | 39.6% |
| OpenBookQA | 44.2% | 43.1% | 40.8% | 38.0% | 33.5% | 27.4% |