Paper I / IX: The 106% Anomaly and the Super-Baseline Hypothesis

NagusameCS · April 2026 · Part of HyperTensor Papers I-X

106.27%of baseline throughput

p ~ 10^-10statistical significance

4 GPUstested (L40S, A100, 4070, H100)

Abstract

The headline measurement of Paper I---that a PCA-compressed attention mechanism decodes at 106.27% of baseline throughput---appears to violate the conservation of computation: how can doing strictly more work (PCA projection + attention) be faster than doing less (attention alone)? Paper IX systematically investigates this "super-baseline anomaly" across four GPU architectures, three model sizes, and six quantization levels. We identify the root cause as L2 cache residency: at specific rank values (k_int), the Q/K/V working set falls entirely within the GPU's L2 cache, eliminating the L2-to-SMEM round-trip that dominates baseline decode latency. A Roofline model derived from the GPU memory hierarchy correctly predicts the k_int values at which super-baseline throughput is observed on each architecture. The anomaly is not a measurement error but a cache-residency threshold effect, and it establishes a principled design rule for low-rank attention compression: compress to the largest k that fits in L2.

Key Findings

L2 residency is the mechanism: NCU profiling confirms that at k=1024, the attention working set for Llama-3.1-8B Q4_K_M is 28.7 MB, fitting in the 32 MB L2 cache of the RTX 4070.
Cross-architecture invariance: The ratio k_int(d) is consistent across architectures: k_int / d_model = 0.25 for MHA models, independent of GPU vendor or generation.
Not an algorithmic speedup: The throughput gain disappears when L2 is saturated (k > k_int). The effect is purely a memory-hierarchy phenomenon.
Quantization independence: The k_int value depends on d_model, not on the quantization level, because the PCA basis is stored in fp32 regardless of the model's weight dtype.
Roofline model validated: A simple arithmetic-intensity model correctly predicts super-baseline throughput within 2% on all tested configurations.

Reproduction: The 12-repetition CI pack uses geod.ps1 benchmark --mode super-baseline --gpu l2. Requires NCU (NVIDIA Compute Utility) for the L2 trace. Full protocol at docs/BENCHMARK_PROTOCOL.md sec.4.

Download PDF LaTeX source NCU trace excerpt