Reproduce A · GRC

Reproduce Paper A: Geodesic Runtime Compression

William Ken Ohara Stewart (NagusameCS Independent Research)

HyperTensor Project · April 2026 · Paper A (HTML) · Paper A (PDF) · repro tree

Scope

This guide reproduces the headline 106.27 percent decode-throughput result and the Nsight Compute L2 trace that refutes the cache-fit hypothesis. Everything runs from the repository root on a Windows host with an NVIDIA GPU.

Hardware target

Prerequisites

1. Build the runtime

cd <repo_root>
.\build_host.ps1
# expect: build_host\geodessical.exe

2. Set model path and verify a baseline run

$model = "C:\path\to\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"
.\build_host\geodessical.exe $model -p "Hello, world" -n 32
# expect: a [GD] line reporting tok/s. Reference GPU is around 35-36 tok/s.

3. Headline 106.27 percent measurement

Apply the calibration-free attention compression at rank 1024 and compare against the same baseline. The 30-second pre-run idle is mandatory to avoid the thermal-throttling false negative reported in the paper.

Start-Sleep -Seconds 30
# baseline (8 conditions, 12 reps each)
.\scripts\benchmark_whitepaper_finalize.ps1 -CooldownSec 30

Expected ranges on the reference GPU:

4. NCU L2 trace

Reproduces table 4 of the paper. Profiles 200 representative decode launches per kernel under both configurations.

$metrics = "lts__t_sectors_op_read.sum,lts__t_sector_hit_rate.pct,dram__bytes_read.sum,sm__warps_active.avg.pct_of_peak_sustained_active"
New-Item -ItemType Directory -Force docs\figures\paper-a\ncu | Out-Null

# baseline trace
ncu --csv --metrics $metrics --target-processes all `
    --launch-count 200 --launch-skip 50 --print-summary per-kernel `
    .\build_host\geodessical.exe $model -p "the quick brown fox" -n 16 --temp 0 `
    *> docs\figures\paper-a\ncu\baseline_raw.txt

# compressed trace
ncu --csv --metrics $metrics --target-processes all `
    --launch-count 200 --launch-skip 50 --print-summary per-kernel `
    .\build_host\geodessical.exe $model `
    --axex-compress --axex-compress-rank 1024 --axex-skip-o `
    -p "the quick brown fox" -n 16 --temp 0 `
    *> docs\figures\paper-a\ncu\k1024_raw.txt

# parse to summary
python scripts\parse_ncu.py

Expected weighted L2 hit rates: 17.0 percent baseline against 16.3 percent at k=1024, a delta of 0.71 percentage points (which is the finding that refutes cache-fit).

5. Outputs to commit

Tolerances

What can go wrong

Pre-built W_proj caches (skip calibration)

First-run calibration (~90 s on Ryzen 9 7940HS) can be skipped by downloading a pre-computed ott_wproj_cache_*.bin from the HyperTensor GitHub Releases page. Each release is keyed by (base model, quantisation, rank k) and includes the SHA256, exact base-model commit, and resource requirements. Browse them on the Models & caches tab.

# Headline configuration (k=1536)
gh release download wproj-cache-2405A3B6 \
  --repo NagusameCS/HyperTensor \
  --pattern 'ott_wproj_cache_*.bin'

# Verify checksum (matches the release page)
Get-FileHash ott_wproj_cache_*.bin -Algorithm SHA256

# Place next to the GGUF and run; the runtime auto-detects.
.\build_host\geodessical.exe `
  --model models\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf `
  --axex-compress --axex-attn-only --axex-weight-pca `
  --axex-compress-rank 1536 --tokens 64 --prompt "..."

See docs/data/release_manifest.json for the full machine-readable manifest of available caches and their resource requirements.