Paper VI · May 2026 · v1.0

Task-Level Impact of GRC Compression

GRC is faster — but does it still work? Full 5-benchmark sweep measuring what GRC preserves and what it costs.

By William Ken Ohara Stewart (NagusameCS) · Repository · TeX source

Abstract

Paper I demonstrated that GRC attention compression at $k=1024$ achieves 106.27% baseline throughput on Llama-3.1-8B. But what about task quality? This paper measures GRC's impact across five standard benchmarks: ARC-Challenge (factual reasoning), HellaSwag (commonsense), MMLU (multidisciplinary knowledge), GSM8K (math), and PIQA (physical reasoning). At $k=1024$, factual recall is near-baseline (ARC: 78.4% unchanged, HellaSwag: 79.8→79.2%), MMLU shows moderate degradation (64.9→62.4%), and math reasoning degrades the most (GSM8K: 49.7→41.5%). At $k=1536$, performance is nearly indistinguishable from baseline on all tasks except GSM8K (−2.1pp). The results validate the UGT zone-specialisation hypothesis: attention compression primarily affects construction-based reasoning (math, code) while discovery-based factual knowledge remains robust.

1. Experimental Setup

Model: Llama-3.1-8B-Instruct Q4_K_M. Compression: GRC attention-only (Paper I protocol). Ranks tested: $k \in \{1024, 1536, 2048, 4096\}$ (4096 = uncompressed baseline). Benchmarks: ARC-Challenge (25-shot), HellaSwag (10-shot), MMLU (5-shot), GSM8K (5-shot), PIQA (0-shot). Evaluation: lm-evaluation-harness v0.4.4.

2. Results: Full Benchmark Sweep

Benchmark$k=1024$$k=1536$$k=2048$Baseline ($d=4096$)
ARC-Challenge78.4%78.5%78.4%78.4%
HellaSwag79.2%79.6%79.7%79.8%
MMLU62.4%63.7%64.5%64.9%
GSM8K41.5%47.6%49.1%49.7%
PIQA78.8%79.0%79.2%79.3%

Key Findings

Factual recall is robust: ARC-Challenge shows zero degradation at any compression level. HellaSwag and PIQA lose only 0.5–0.6pp at $k=1024$. These tasks rely on stored factual/commonsense knowledge that attention compression preserves.

Math reasoning degrades: GSM8K drops 8.2pp at $k=1024$, recovering to within 2.1pp at $k=1536$. Math requires multi-step reasoning chains where attention quality matters more.

MMLU is intermediate: 2.5pp drop at $k=1024$, near-baseline at $k=1536. MMLU's mixed fact/reasoning structure explains the intermediate sensitivity.

$k=1536$ is the sweet spot: Nearly indistinguishable from baseline on all tasks except GSM8K (−2.1pp). This matches the L2 cache-fit prediction from Paper IX ($k^* \approx 1536$ for 36 MB L2).

3. UGT Zone Hypothesis Validation

The results align with the Universal Geodesic Taxonomy (Paper XI): attention compression primarily affects Construction-based knowledge (math, code — internal logic systems) while Discovery-based knowledge (factual recall, trivia — external reality) remains robust. GSM8K (Construction × Objective) shows the largest degradation; ARC (Discovery × Objective) shows none. This provides empirical support for the 2×2 zone taxonomy.

References

  1. Stewart, W.K.O. GRC Attention Compression. HyperTensor Paper I, 2026.
  2. Stewart, W.K.O. Universal Geodesic Taxonomy. HyperTensor Paper XI, 2026.
  3. Gao, L. et al. lm-evaluation-harness. GitHub, 2024.