Scope
This guide reproduces the headline 106.27 percent decode-throughput result and the Nsight Compute L2 trace that refutes the cache-fit hypothesis. Everything runs from the repository root on a Windows host with an NVIDIA GPU.
Hardware target
- Reference GPU: NVIDIA RTX 4070 Laptop (Ada AD106), 8 GB GDDR6, 32 MB L2.
- Minimum: any CUDA 12.x GPU with at least 8 GB VRAM.
- Host: 16 GB RAM, NVMe storage with at least 10 GB free.
Prerequisites
- CUDA driver 552 or later.
- Zig CC toolchain (build script
build_host.ps1pins the version). - PowerShell 7+.
- The model file
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguffrom bartowski on HuggingFace. - Optional but recommended: NVIDIA Nsight Compute 2026.1 or later for the L2 trace step.
1. Build the runtime
cd <repo_root>
.\build_host.ps1
# expect: build_host\geodessical.exe
2. Set model path and verify a baseline run
$model = "C:\path\to\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"
.\build_host\geodessical.exe $model -p "Hello, world" -n 32
# expect: a [GD] line reporting tok/s. Reference GPU is around 35-36 tok/s.
3. Headline 106.27 percent measurement
Apply the calibration-free attention compression at rank 1024 and compare against the same baseline. The 30-second pre-run idle is mandatory to avoid the thermal-throttling false negative reported in the paper.
Start-Sleep -Seconds 30
# baseline (8 conditions, 12 reps each)
.\scripts\benchmark_whitepaper_finalize.ps1 -CooldownSec 30
Expected ranges on the reference GPU:
- k=1024 decode ratio: 1.060 to 1.065 (paper measured 1.0627).
- k=1536 decode ratio: 0.95 to 1.02 (paper measured 0.9755).
- WikiText-2 perplexity at k=1536: about 7.69 (baseline 6.79).
4. NCU L2 trace
Reproduces table 4 of the paper. Profiles 200 representative decode launches per kernel under both configurations.
$metrics = "lts__t_sectors_op_read.sum,lts__t_sector_hit_rate.pct,dram__bytes_read.sum,sm__warps_active.avg.pct_of_peak_sustained_active"
New-Item -ItemType Directory -Force docs\figures\paper-a\ncu | Out-Null
# baseline trace
ncu --csv --metrics $metrics --target-processes all `
--launch-count 200 --launch-skip 50 --print-summary per-kernel `
.\build_host\geodessical.exe $model -p "the quick brown fox" -n 16 --temp 0 `
*> docs\figures\paper-a\ncu\baseline_raw.txt
# compressed trace
ncu --csv --metrics $metrics --target-processes all `
--launch-count 200 --launch-skip 50 --print-summary per-kernel `
.\build_host\geodessical.exe $model `
--axex-compress --axex-compress-rank 1024 --axex-skip-o `
-p "the quick brown fox" -n 16 --temp 0 `
*> docs\figures\paper-a\ncu\k1024_raw.txt
# parse to summary
python scripts\parse_ncu.py
Expected weighted L2 hit rates: 17.0 percent baseline against 16.3 percent at k=1024, a delta of 0.71 percentage points (which is the finding that refutes cache-fit).
5. Outputs to commit
docs/figures/paper-a/ncu/summary.csvdocs/figures/paper-a/ncu/summary.jsondocs/figures/paper-a/ncu/baseline_raw.txtdocs/figures/paper-a/ncu/k1024_raw.txt- The whitepaper pack under
benchmarks/whitepaper_pack_*.
Tolerances
- Throughput: plus or minus 5 percent to absorb GPU clock variance.
- Perplexity: deterministic to four decimal places.
- L2 hit rate: noise floor about 1 percentage point at the launch count above.
What can go wrong
- Skipping the 30-second pre-run idle on a laptop will produce a false 50-60 percent retention reading. Always cool down.
- Running with another GPU-bound process resident produces about 3.5 percent throughput contamination on both arms (this was the Paper B MCR contamination, but it applies to A as well).
- If
ncuemits empty CSV the most likely cause is a stale CUDA driver. Update to 552 or later.
Pre-built W_proj caches (skip calibration)
First-run calibration (~90 s on Ryzen 9 7940HS) can be skipped
by downloading a pre-computed ott_wproj_cache_*.bin from
the HyperTensor GitHub Releases page. Each release is keyed by
(base model, quantisation, rank k) and includes the SHA256, exact
base-model commit, and resource requirements. Browse them on the
Models & caches tab.
# Headline configuration (k=1536)
gh release download wproj-cache-2405A3B6 \
--repo NagusameCS/HyperTensor \
--pattern 'ott_wproj_cache_*.bin'
# Verify checksum (matches the release page)
Get-FileHash ott_wproj_cache_*.bin -Algorithm SHA256
# Place next to the GGUF and run; the runtime auto-detects.
.\build_host\geodessical.exe `
--model models\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf `
--axex-compress --axex-attn-only --axex-weight-pca `
--axex-compress-rank 1536 --tokens 64 --prompt "..."
See docs/data/release_manifest.json for the full machine-readable manifest of available caches and their resource requirements.