Reproduce Paper I: Cross-GPU Super-Baseline, HyperTensor Research

Scope

Validates the cross-GPU super-baseline throughput model: GRC throughput ratio T_GRC(k)/T_standard systematically exceeds 1.0 when the projection basis fits in L2 cache. The optimal k* depends on GPU L2 size --- k=1536 for RTX 4070/4080 (36-64MB L2), k=1280 for L40S (48MB L2), and k*=1024 for A100 (40MB L2). This guide runs the analytical simulator and, when GPU is available, the measured throughput sweep.

Hardware target

Simulator: any machine (CPU-only).
Measured sweep: any NVIDIA GPU ≥8GB VRAM with CUDA 12.x.
Reference GPU: RTX 4070 Laptop (36MB L2) for the 106% headline.

Prerequisites

Python 3.10+, NumPy, PyTorch.
Geodessical binary (build_host.ps1) for measured sweep.

Step 1: Analytical simulator

python scripts/benchmark_super_baseline.py --k-range 64-2048 --gpu "RTX 4070 Laptop"

Expected: k*=1536 with throughput ratio 1.04-1.06.

Step 2: Measured throughput sweep (requires GPU + binary)

python scripts/bench_tv_of_k.py --model ../models/qwen2.5-7b-q4_k_m.gguf \
    --k-values 256,512,768,1024,1280,1536,1792 --reps 3

Expected: peak TPS at k* matching the simulator prediction within ±128.

Step 3: Cross-GPU validation table

Run on each available GPU and compare against the predictive table:

# RTX 4070 Laptop (36MB L2): k=1536, ratio≈1.04 [ok] MEASURED
# L40S (48MB L2):           k=1280, ratio≈1.04   PENDING
# A100 (40MB L2):           k=1024, ratio≈1.06   PENDING
# H100 (50MB L2):           k=1280, ratio≈1.04   PENDING