Abstract
Paper I demonstrated that GRC attention compression at $k=1024$ achieves 106.27% baseline throughput on Llama-3.1-8B. But what about task quality? This paper measures GRC's impact across five standard benchmarks: ARC-Challenge (factual reasoning), HellaSwag (commonsense), MMLU (multidisciplinary knowledge), GSM8K (math), and PIQA (physical reasoning). At $k=1024$, factual recall is near-baseline (ARC: 78.4% unchanged, HellaSwag: 79.8→79.2%), MMLU shows moderate degradation (64.9→62.4%), and math reasoning degrades the most (GSM8K: 49.7→41.5%). At $k=1536$, performance is nearly indistinguishable from baseline on all tasks except GSM8K (−2.1pp). The results validate the UGT zone-specialisation hypothesis: attention compression primarily affects construction-based reasoning (math, code) while discovery-based factual knowledge remains robust.
1. Experimental Setup
Model: Llama-3.1-8B-Instruct Q4_K_M. Compression: GRC attention-only (Paper I protocol). Ranks tested: $k \in \{1024, 1536, 2048, 4096\}$ (4096 = uncompressed baseline). Benchmarks: ARC-Challenge (25-shot), HellaSwag (10-shot), MMLU (5-shot), GSM8K (5-shot), PIQA (0-shot). Evaluation: lm-evaluation-harness v0.4.4.
2. Results: Full Benchmark Sweep
| Benchmark | $k=1024$ | $k=1536$ | $k=2048$ | Baseline ($d=4096$) |
|---|---|---|---|---|
| ARC-Challenge | 78.4% | 78.5% | 78.4% | 78.4% |
| HellaSwag | 79.2% | 79.6% | 79.7% | 79.8% |
| MMLU | 62.4% | 63.7% | 64.5% | 64.9% |
| GSM8K | 41.5% | 47.6% | 49.1% | 49.7% |
| PIQA | 78.8% | 79.0% | 79.2% | 79.3% |
Key Findings
Factual recall is robust: ARC-Challenge shows zero degradation at any compression level. HellaSwag and PIQA lose only 0.5–0.6pp at $k=1024$. These tasks rely on stored factual/commonsense knowledge that attention compression preserves.
Math reasoning degrades: GSM8K drops 8.2pp at $k=1024$, recovering to within 2.1pp at $k=1536$. Math requires multi-step reasoning chains where attention quality matters more.
MMLU is intermediate: 2.5pp drop at $k=1024$, near-baseline at $k=1536. MMLU's mixed fact/reasoning structure explains the intermediate sensitivity.
$k=1536$ is the sweet spot: Nearly indistinguishable from baseline on all tasks except GSM8K (−2.1pp). This matches the L2 cache-fit prediction from Paper IX ($k^* \approx 1536$ for 36 MB L2).
3. UGT Zone Hypothesis Validation
The results align with the Universal Geodesic Taxonomy (Paper XI): attention compression primarily affects Construction-based knowledge (math, code — internal logic systems) while Discovery-based knowledge (factual recall, trivia — external reality) remains robust. GSM8K (Construction × Objective) shows the largest degradation; ARC (Discovery × Objective) shows none. This provides empirical support for the 2×2 zone taxonomy.
References
- Stewart, W.K.O. GRC Attention Compression. HyperTensor Paper I, 2026.
- Stewart, W.K.O. Universal Geodesic Taxonomy. HyperTensor Paper XI, 2026.
- Gao, L. et al. lm-evaluation-harness. GitHub, 2024.