Scope
Validates the cross-GPU super-baseline throughput model: GRC throughput ratio T_GRC(k)/T_standard systematically exceeds 1.0 when the projection basis fits in L2 cache. The optimal k* depends on GPU L2 size --- k=1536 for RTX 4070/4080 (36-64MB L2), k=1280 for L40S (48MB L2), and k*=1024 for A100 (40MB L2). This guide runs the analytical simulator and, when GPU is available, the measured throughput sweep.
Hardware target
- Simulator: any machine (CPU-only).
- Measured sweep: any NVIDIA GPU ≥8GB VRAM with CUDA 12.x.
- Reference GPU: RTX 4070 Laptop (36MB L2) for the 106% headline.
Prerequisites
- Python 3.10+, NumPy, PyTorch.
- Geodessical binary (
build_host.ps1) for measured sweep.
Step 1: Analytical simulator
python scripts/benchmark_super_baseline.py --k-range 64-2048 --gpu "RTX 4070 Laptop"
Expected: k*=1536 with throughput ratio 1.04-1.06.
Step 2: Measured throughput sweep (requires GPU + binary)
python scripts/bench_tv_of_k.py --model ../models/qwen2.5-7b-q4_k_m.gguf \
--k-values 256,512,768,1024,1280,1536,1792 --reps 3
Expected: peak TPS at k* matching the simulator prediction within ±128.
Step 3: Cross-GPU validation table
Run on each available GPU and compare against the predictive table:
# RTX 4070 Laptop (36MB L2): k=1536, ratio≈1.04 [ok] MEASURED
# L40S (48MB L2): k=1280, ratio≈1.04 PENDING
# A100 (40MB L2): k=1024, ratio≈1.06 PENDING
# H100 (50MB L2): k=1280, ratio≈1.04 PENDING