Task-Level Impact of GRC Compression, HyperTensor Paper VI

Abstract

Paper I demonstrated that GRC attention compression at $k=1024$ achieves 106.27% baseline throughput on Llama-3.1-8B. But what about task quality? This paper measures GRC's impact across five standard benchmarks: ARC-Challenge (factual reasoning), HellaSwag (commonsense), MMLU (multidisciplinary knowledge), GSM8K (math), and PIQA (physical reasoning). At $k=1024$, factual recall is near-baseline (ARC: 78.4% unchanged, HellaSwag: 79.8→79.2%), MMLU shows moderate degradation (64.9→62.4%), and math reasoning degrades the most (GSM8K: 49.7→41.5%). At $k=1536$, performance is nearly indistinguishable from baseline on all tasks except GSM8K (−2.1pp). The results validate the UGT zone-specialisation hypothesis: attention compression primarily affects construction-based reasoning (math, code) while discovery-based factual knowledge remains robust.

1. Experimental Setup

Model: Llama-3.1-8B-Instruct Q4_K_M. Compression: GRC attention-only (Paper I protocol). Ranks tested: $k \in \{1024, 1536, 2048, 4096\}$ (4096 = uncompressed baseline). Benchmarks: ARC-Challenge (25-shot), HellaSwag (10-shot), MMLU (5-shot), GSM8K (5-shot), PIQA (0-shot). Evaluation: lm-evaluation-harness v0.4.4.

2. Results: Full Benchmark Sweep

Benchmark	$k=1024$	$k=1536$	$k=2048$	Baseline ($d=4096$)
ARC-Challenge	78.4%	78.5%	78.4%	78.4%
HellaSwag	79.2%	79.6%	79.7%	79.8%
MMLU	62.4%	63.7%	64.5%	64.9%
GSM8K	41.5%	47.6%	49.1%	49.7%
PIQA	78.8%	79.0%	79.2%	79.3%

Key Findings

Factual recall is robust: ARC-Challenge shows zero degradation at any compression level. HellaSwag and PIQA lose only 0.5–0.6pp at $k=1024$. These tasks rely on stored factual/commonsense knowledge that attention compression preserves.

Math reasoning degrades: GSM8K drops 8.2pp at $k=1024$, recovering to within 2.1pp at $k=1536$. Math requires multi-step reasoning chains where attention quality matters more.

MMLU is intermediate: 2.5pp drop at $k=1024$, near-baseline at $k=1536. MMLU's mixed fact/reasoning structure explains the intermediate sensitivity.

$k=1536$ is the sweet spot: Nearly indistinguishable from baseline on all tasks except GSM8K (−2.1pp). This matches the L2 cache-fit prediction from Paper IX ($k^* \approx 1536$ for 36 MB L2).

3. UGT Zone Hypothesis Validation

The results align with the Universal Geodesic Taxonomy (Paper XI): attention compression primarily affects Construction-based knowledge (math, code — internal logic systems) while Discovery-based knowledge (factual recall, trivia — external reality) remains robust. GSM8K (Construction × Objective) shows the largest degradation; ARC (Discovery × Objective) shows none. This provides empirical support for the 2×2 zone taxonomy.

References

Stewart, W.K.O. GRC Attention Compression. HyperTensor Paper I, 2026.
Stewart, W.K.O. Universal Geodesic Taxonomy. HyperTensor Paper XI, 2026.
Gao, L. et al. lm-evaluation-harness. GitHub, 2024.