Reproduce Paper C: Geodesic speculative decoding, HyperTensor Research

Scope

Reproduces the OTT-aware verifier, the EOS-aware acceptance protocol and the partial T_V(k) sweep. The full T_V(k) sweep is EC2-bound because the cold weight-PCA cache for k below 256 exceeds the RTX 4070 Laptop wall-time budget on Llama-3.1-8B.

Hardware target

Local headline acceptance rate: any 8 GB CUDA GPU.
Full T_V(k) sweep: g6e.xlarge L40S, about 90 minutes wall.

Prerequisites

Same toolchain as Paper A.
Pre-warmed weight-PCA cache for the rank set you intend to sweep. The cache is regenerated per (model, k) pair and re-used. Cold-PCA at k=256 takes about 15 minutes on the reference GPU.

1. Headline acceptance rate

.\build_host\geodessical.exe $model `
    --axex-spec --axex-spec-tree-k 4 --axex-spec-target-acc 0.4 `
    -p "Write a sorting algorithm in Python" -n 256 --temp 0

Expected: about 38.5 percent acceptance under the EOS-aware schedule.

2. T_V(k) partial sweep

python scripts\bench_tv_of_k.py

The script writes docs/figures/paper-c/tv_of_k.csv. The local run produces a partial curve; alpha collapses to 0 percent at k at or below 256 because of cold-cache PCA. This is reported as a hardware-specific artefact in section 6 of the paper, not as a mechanism finding.

3. EC2: full T_V(k) sweep

python scripts/bench_tv_of_k.py

On the L40S the cold-PCA cost drops below the timeout, so the k=128 and k=256 points populate. Expected total wall time is about 90 minutes.

Outputs

docs/figures/paper-c/tv_of_k.csv
docs/figures/paper-c/spec_acceptance_pack/

Tolerances

Acceptance rate: plus or minus 1.5 percentage points run-over-run on identical seeds.
T_V(k) shape: monotone decreasing in k for k at or above 512 on Llama-3.1-8B.