Reproduce Paper A: GRC attention compression, HyperTensor Research

Scope

This guide reproduces the headline 106.27 percent decode-throughput result and the Nsight Compute L2 trace that refutes the cache-fit hypothesis. Everything runs from the repository root on a Windows host with an NVIDIA GPU.

Hardware target

Reference GPU: NVIDIA RTX 4070 Laptop (Ada AD106), 8 GB GDDR6, 32 MB L2.
Minimum: any CUDA 12.x GPU with at least 8 GB VRAM.
Host: 16 GB RAM, NVMe storage with at least 10 GB free.

Prerequisites

CUDA driver 552 or later.
Zig CC toolchain (build script build_host.ps1 pins the version).
PowerShell 7+.
The model file Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf from bartowski on HuggingFace.
Optional but recommended: NVIDIA Nsight Compute 2026.1 or later for the L2 trace step.

1. Build the runtime

cd <repo_root>
.\build_host.ps1
# expect: build_host\geodessical.exe

2. Set model path and verify a baseline run

$model = "C:\path\to\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"
.\build_host\geodessical.exe $model -p "Hello, world" -n 32
# expect: a [GD] line reporting tok/s. Reference GPU is around 35-36 tok/s.

3. Headline 106.27 percent measurement

Apply the calibration-free attention compression at rank 1024 and compare against the same baseline. The 30-second pre-run idle is mandatory to avoid the thermal-throttling false negative reported in the paper.

Start-Sleep -Seconds 30
# baseline (8 conditions, 12 reps each)
.\scripts\benchmark_whitepaper_finalize.ps1 -CooldownSec 30

Expected ranges on the reference GPU:

k=1024 decode ratio: 1.060 to 1.065 (paper measured 1.0627).
k=1536 decode ratio: 0.95 to 1.02 (paper measured 0.9755).
WikiText-2 perplexity at k=1536: about 7.69 (baseline 6.79).

4. NCU L2 trace

Reproduces table 4 of the paper. Profiles 200 representative decode launches per kernel under both configurations.

$metrics = "lts__t_sectors_op_read.sum,lts__t_sector_hit_rate.pct,dram__bytes_read.sum,sm__warps_active.avg.pct_of_peak_sustained_active"
New-Item -ItemType Directory -Force docs\figures\paper-a\ncu | Out-Null

# baseline trace
ncu --csv --metrics $metrics --target-processes all `
    --launch-count 200 --launch-skip 50 --print-summary per-kernel `
    .\build_host\geodessical.exe $model -p "the quick brown fox" -n 16 --temp 0 `
    *> docs\figures\paper-a\ncu\baseline_raw.txt

# compressed trace
ncu --csv --metrics $metrics --target-processes all `
    --launch-count 200 --launch-skip 50 --print-summary per-kernel `
    .\build_host\geodessical.exe $model `
    --axex-compress --axex-compress-rank 1024 --axex-skip-o `
    -p "the quick brown fox" -n 16 --temp 0 `
    *> docs\figures\paper-a\ncu\k1024_raw.txt

# parse to summary
python scripts\parse_ncu.py

Expected weighted L2 hit rates: 17.0 percent baseline against 16.3 percent at k=1024, a delta of 0.71 percentage points (which is the finding that refutes cache-fit).

5. Outputs to commit

docs/figures/paper-a/ncu/summary.csv
docs/figures/paper-a/ncu/summary.json
docs/figures/paper-a/ncu/baseline_raw.txt
docs/figures/paper-a/ncu/k1024_raw.txt
The whitepaper pack under benchmarks/whitepaper_pack_*.

Tolerances

Throughput: plus or minus 5 percent to absorb GPU clock variance.
Perplexity: deterministic to four decimal places.
L2 hit rate: noise floor about 1 percentage point at the launch count above.

What can go wrong

Skipping the 30-second pre-run idle on a laptop will produce a false 50-60 percent retention reading. Always cool down.
Running with another GPU-bound process resident produces about 3.5 percent throughput contamination on both arms (this was the Paper B MCR contamination, but it applies to A as well).
If ncu emits empty CSV the most likely cause is a stale CUDA driver. Update to 552 or later.

Pre-built W_proj caches (skip calibration)

First-run calibration (~90 s on Ryzen 9 7940HS) can be skipped by downloading a pre-computed ott_wproj_cache_*.bin from the HyperTensor GitHub Releases page. Each release is keyed by (base model, quantisation, rank k) and includes the SHA256, exact base-model commit, and resource requirements. Browse them on the Models & caches tab.

# Headline configuration (k=1536)
gh release download wproj-cache-2405A3B6 \
  --repo NagusameCS/HyperTensor \
  --pattern 'ott_wproj_cache_*.bin'

# Verify checksum (matches the release page)
Get-FileHash ott_wproj_cache_*.bin -Algorithm SHA256

# Place next to the GGUF and run; the runtime auto-detects.
.\build_host\geodessical.exe `
  --model models\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf `
  --axex-compress --axex-attn-only --axex-weight-pca `
  --axex-compress-rank 1536 --tokens 64 --prompt "..."

See docs/data/release_manifest.json for the full machine-readable manifest of available caches and their resource requirements.