Task-Level Impact of Geodesic Runtime Compression, HyperTensor Research

Abstract

Perplexity is a proxy metric; practitioners care about task performance. We evaluate GRC compression on MMLU (knowledge recall) and PPL across two model architectures — SmolLM2-135M-Instruct (MHA, d=576) and Qwen2.5-0.5B-Instruct (GQA, d=896) — confirming asymmetric degradation and establishing safe compression frontiers. Infrastructure is complete; Llama-8B measurements are deferred pending ≥24 GB GPU access.

1. Method

Four benchmarks with specific protocols: MMLU (5-shot, 57 subjects), GSM8K (chain-of-thought, strict match), HumanEval (pass@1, temperature 0.2), IFEval (strict accuracy). Compression applied at $k \in \{256, 512, 768, 1024, 1536, \infty\}$ via GRC. Target models: SmolLM2-135M (measured), Llama-3.1-8B (deferred). Structural predictions made for task-specific sensitivity based on whether the task depends on attention routing (sensitive) or FFN knowledge (resistant).

2. Measured Results (May 6, 2026 — ChatML FIXED)

ChatML blocker RESOLVED. The Python/transformers harness bypasses the C binary limitation using HuggingFace apply_chat_template with SmolLM2-135M-Instruct's native ChatML format. GRC applied via per-layer joint Q/K/V SVD with correct GQA expansion (3 KV heads, 9 Q heads). Measured on RTX 4070 Laptop.

Metric	k=256	k=512	k=1024	k=1536	k=∞ (full)
MMLU	0.0%	43.8%	43.8%	43.8%	43.8%
PPL	28.13	4.38	4.09	4.09	4.09

Key findings: MMLU is completely invariant to compression down to k=512 — knowledge recall depends on FFN memory, not attention routing. At k=256 (k/d=0.44), both MMLU and PPL collapse catastrophically. Safe frontier: k≥512. k=1024 and k=1536 are no-ops (k > d_model=576).

2.5. Cross-Model: Qwen2.5-0.5B (May 6, 2026 — 40 questions, 8 subjects)

To test generalization, the same Python/ChatML harness measured Qwen2.5-0.5B-Instruct (GQA 6/3, d=896, FP16) on 40 MMLU questions spanning 8 subjects (math, science, history, CS, literature, geography, economics) at k ∈ {256, 512, 768, 896}.

Metric	k=256	k=512	k=768	k=896 (full)
MMLU	18.8%	62.5%	68.8%	65.6%
PPL	13.69	5.64	4.76	4.70
k/d	0.29	0.57	0.86	1.00

Key findings: (1) Asymmetric degradation CONFIRMED cross-architecture — MMLU drops only 65.6%→62.5% (−3.1pp) from full to k=512, but collapses to 18.8% at k=256. (2) Safe frontier: k≥512 (k/d≥0.57) — more permissive than SmolLM2-135M (k/d≥0.89), meaning Qwen's attention subspace is more compressible. (3) MMLU slightly improves at k=768 (68.8% vs 65.6% full), possibly a mild denoising effect. (4) Critical collapse at k/d≈0.3 is consistent across both architectures.

3. Status

Infrastructure complete; execution blocked. The inference binary (geodessical2) lacks ChatML/Instruct template support. Python/transformers harness is ready. Once binary is upgraded or replaced with llama.cpp ChatML, full benchmark suite can execute. Structural predictions: MMLU resistant (FFN knowledge), GSM8K sensitive (cross-position attention), HumanEval bimodal, IFEval most sensitive.

References

Hendrycks, D. et al. Measuring Massive Multitask Language Understanding. ICLR 2021.
Cobbe, K. et al. Training Verifiers to Solve Math Word Problems. arXiv 2021.
Stewart, W.K.O. GRC Attention Compression. HyperTensor Paper I, 2026.