Paper I / IX: The 106% Anomaly and the Super-Baseline Hypothesis

NagusameCS · April 2026 · Part of HyperTensor Papers I-X
106.27%of baseline throughput
p ~ 10^-10statistical significance
4 GPUstested (L40S, A100, 4070, H100)

Abstract

The headline measurement of Paper I---that a PCA-compressed attention mechanism decodes at 106.27% of baseline throughput---appears to violate the conservation of computation: how can doing strictly more work (PCA projection + attention) be faster than doing less (attention alone)? Paper IX systematically investigates this "super-baseline anomaly" across four GPU architectures, three model sizes, and six quantization levels. We identify the root cause as L2 cache residency: at specific rank values (k_int), the Q/K/V working set falls entirely within the GPU's L2 cache, eliminating the L2-to-SMEM round-trip that dominates baseline decode latency. A Roofline model derived from the GPU memory hierarchy correctly predicts the k_int values at which super-baseline throughput is observed on each architecture. The anomaly is not a measurement error but a cache-residency threshold effect, and it establishes a principled design rule for low-rank attention compression: compress to the largest k that fits in L2.

Key Findings

Reproduction: The 12-repetition CI pack uses geod.ps1 benchmark --mode super-baseline --gpu l2. Requires NCU (NVIDIA Compute Utility) for the L2 trace. Full protocol at docs/BENCHMARK_PROTOCOL.md sec.4.
Download PDF LaTeX source NCU trace excerpt