Paper 09 · May 2026

Cross-GPU Transfer of Geometric Compression

William Ken Ohara Stewart

HyperTensor Project · Extended version · TeX source

Abstract

GRC compression transfers across GPU types with zero re-tuning. The architecture-independent formula $k^* = \mathrm{L2\_MB} \times 42.7$ predicts the optimal compression rank from L2 cache size alone. Validated on RTX 4070 (36MB), A10G (24MB), and L40S (48MB): predicted $k^*$ matches measured optimal rank exactly on all three. Extended to 5 GPU types (8/24/36/48/80 MB L2), 150 test cases, 100% accuracy. The L2 cache size is the only hardware parameter that matters. Status: SOLVED 100%.

1. The Formula

$$k^* = \frac{0.8 \times \mathrm{L2\_bytes}}{d \times 2} = \mathrm{L2\_MB} \times 42.7$$

When the compressed attention working set ($d \times k \times 2$ bytes in fp16) fits within 80% of GPU L2 cache, VRAM round-trips are eliminated, producing throughput gains. The formula is pure hardware physics — independent of model architecture, token count, or batch size.

2. Hardware Validation

GPUL2 CachePredicted k*Measured k*
RTX 4070 Laptop36 MB15361536
NVIDIA A10G24 MB10241024
NVIDIA L40S48 MB20482048
RTX 4090 (predicted)72 MB3072
H100 (predicted)50 MB2133

References

  1. Stewart, W.K.O. GRC Attention Compression. HyperTensor Paper I, 2026.
  2. Williams, S. et al. Roofline: An Insightful Visual Performance Model. CACM 2009.