Paper C - OTT-Decode

Geodesic Speculative Decoding: OTT-aware verifier with EOS-aware acceptance

HyperTensor Project (William Ken Ohara Stewart)

HyperTensor Project · April 2026 · PDF · TeX source

Project nomenclature. HyperTensor is the project; Geodesic Projection (GP) is the inference-time pipeline that factorises every compressible weight (companion paper); Geodesic Runtime Compression (GRC) is the attention-only subset of GP and is the empirical anchor for the throughput measurements below; OTT/GTC is the Riemannian-manifold theoretical framework on top of GP that motivates the OneDecode / OTT-OD / OTT-SWARM drafter modes briefly described in 8.

Why Compose

Speculative decoding wins because the verifier amortises its forward-pass cost over multiple accepted tokens. Compression reduces the drafter's cost. These look independent, and on first principles they are, but the interaction has a specific shape that matters.

Let \(T_V\) be the verifier-step cost, \(T_D\) the drafter-step cost, \(\gamma\) the draft length, and \(\alpha\) the per-token acceptance rate (Leviathan et al. 2023). Standard speculative throughput is \[\begin{equation} \label{eq:spec} \text{tok/s}_\text{spec} = \frac{\E[\text{accepted}]}{\gamma T_D + T_V} = \frac{1-\alpha^{\gamma+1}}{(1-\alpha)(\gamma T_D+T_V)}. \end{equation}\] With a GP-compressed drafter the drafter cost drops from \(T_D^\text{full}\) to \(T_D^\text{GP}=(1-\rho)T_D^\text{full}\) where \(\rho\) is the bandwidth saving, and the acceptance rate may also drop because the drafter samples from a perturbed distribution. Define \(\Delta\alpha\) as the loss in acceptance. Net speedup over full speculative decoding is \[\begin{equation} \label{eq:netspeedup} \frac{\text{tok/s}_\text{spec,GP}}{\text{tok/s}_\text{spec,full}} = \underbrace{\frac{1-(\alpha-\Delta\alpha)^{\gamma+1}}{1-\alpha^{\gamma+1}}}_{\leq 1} \cdot \underbrace{\frac{1-\alpha}{1-\alpha+\Delta\alpha}}_{\leq 1} \cdot \underbrace{\frac{\gamma T_D^\text{full}+T_V}{\gamma(1-\rho)T_D^\text{full}+T_V}}_{\geq 1}. \end{equation}\] The third factor (compression helps) is always \(\geq 1\); the first two (compression hurts acceptance) are always \(\leq 1\). Composition wins iff the third factor dominates; that depends on \(T_V/T_D\) and on \(\Delta\alpha\). Whether GP at \(k{=}1024\) moves \(\alpha\) by more than a few percent is the central empirical unknown of this paper.

Implementation

The runtime exposes llm_generate_geodesic_speculative() (declared in runtime/nn/llm.h) implementing the textbook draft-and-verify pattern with two specifics. (i) Drafter and verifier share KV cache structure but not weights, the drafter is the GP-compressed model loaded once, the verifier is uncompressed and loaded once. Both resident in VRAM simultaneously is infeasible for an 8B model on 8 GB VRAM; speculative-decode-of-Llama-3.1-8B-with-itself requires either an additional VRAM tier or a tier-asymmetric deployment (e.g. 8B drafter on commodity GPU, 70B verifier on a server GPU). (ii) Acceptance is rejection sampling against the verifier distribution (Leviathan et al. 2023; Chen et al. 2023), not greedy match; this preserves the verifier's sampling distribution exactly under the standard tokenizer-sharing assumption.

The runtime additionally supports running the drafter alone (--no-verifier) for ablation: this emits the GP-compressed model's output directly, isolating compression cost from speculative gain.

Closed-Form Throughput Estimates

On the reference RTX 4070 Laptop, baseline Llama-3.1-8B Q4_K_M decode is \(35.6\) tok/s, so \(T_D^\text{full}{=}28.1\) ms/token. At GP \(k{=}1024\), decode rises to \(37.8\) tok/s (\(T_D^\text{GP}{=}26.5\) ms/token, a \(5.7\%\) saving). Verifier prefill at \(\gamma{=}4\) on the same hardware is \(\approx 90\) ms (extrapolated from the companion paper, not measured under spec).

Predicted speculative throughput at \(\gamma{=}4\), three values of \(\alpha\), no compression-induced acceptance loss.
\(\alpha\) full-spec tok/s GP-spec tok/s Speedup over full decode
0.90 17.3 17.4 \(\sim 0.49\times\)
0.70 14.4 14.4 \(\sim 0.40\times\)
0.50 10.4 10.5 \(\sim 0.29\times\)

These are predictions, not measurements. Two observations follow. (i) On this hardware (single 8 GB GPU), speculative decoding with the full 8B as verifier is not faster than just decoding the full 8B directly: the verifier is the bottleneck and this is independent of GP. Speculative decoding helps when \(T_V\gg T_D\), the deployment shape "8B drafter on commodity GPU, 70B verifier on a server GPU" that we cannot currently test. (ii) On hypothetical hardware where a 70B verifier is the slow side, the GP saving on the drafter compounds, but only if \(\Delta\alpha\) is small.

Attention Residuals under Compression

Block Attention Residuals (AttnRes) replaces the standard PreNorm accumulation \(x_{\ell+1}=x_\ell+f_\ell(\mathrm{LN}(x_\ell))\) with \(x_{\ell+1}=\sum_{n\leq\ell}\alpha_{n\to\ell}\,b_n(x_\ell)\) where \(b_n\) is a block-summary projector and the \(\alpha\) are softmax weights over a learned (here: default-initialised) pseudo-query. The runtime exposes --attnres with default strength \(0.35\) (the recommended inference-time injection on a model not trained with AttnRes).

Why low-rank attention might help AttnRes.

Vanilla PreNorm transformers have residuals that grow as \(\sqrt L\) in depth. AttnRes counters that. Compressed attention slightly reduces the per-block update magnitude (the projection caps each \(W\cdot x\) at the energy retained by \(U\)), a small further mitigation.

Why low-rank attention might hurt AttnRes.

AttnRes computes a softmax over similarities \(\inner{q}{b_n}\), and \(\rank b_n\) is bounded by \(\rank O\). Compressing \(O\) to rank \(k\) means \(b_n\) lives in (at most) a \(k\)-dim subspace. If two blocks' summaries collapse onto nearby vectors, the softmax becomes noisier and the depth-memory mechanism weakens. This is the failure mode we expect to dominate at small \(k/d\).

Status.

We treat this analysis as a structural prediction rather than an empirical claim. At the time of writing we have not run the controlled AttnRes-on/off sweep across \(k/d\in\{0.25,0.35,0.45,0.55,0.65\}\) that would distinguish the help/hurt regimes above. The honest prior is: wash at moderate compression (\(k/d\in[0.4,0.6]\)) and small accept-rate loss at aggressive compression (\(k/d<0.3\)), driven by softmax noise in the AttnRes mixing weights. Confirming or falsifying this prior is on the project roadmap but is not part of this paper's claims.

KV-Cache Compression

The runtime supports compressing the KV cache itself with the same per-layer basis \(U^{(\ell,K/V)}\) used for \(W_K,W_V\) (flag --axex-kv). This saves memory linearly in context length, not per-step bandwidth. At the protocols tested in the companion paper (decode-only, 200 generated tokens after a short prompt) the KV cache is small and this is a non-issue. At 32k or 128k tokens the KV cache becomes the dominant VRAM consumer; a \(k{=}1024\) projection cuts it by \(\approx 75\%\) at the cost of an additional \(O(kd)\) projection per token on read. We have not measured long-context behaviour. PagedAttention (Kwon et al. 2023) is an orthogonal mechanism that we do not interact with.

First End-to-End Measurements

Setup. SmolLM2-135M-Instruct Q8_0, ChatML template, host binary geodessical under the locked OTT pipeline (repair_ott.ps1), threshold \(0.45\), \(\gamma{=}4\).

First measured speculative-decode results, 2026-04-27.
Quantity Value Notes
OTT readiness geodesic_ready ready=true, hybrid_ready=true, runtime_share=1.0
\(\alpha\) (acceptance) 38.5% 5/13 tokens; 8 verifier corrections
End-to-end tok/s 76.5 13 tokens / 170 ms, batch 4
OD draft hits 5 OneDecode table hits / 13
SWARM-K hits 0 --ott-swarm-k 8 crashes; tracked in 10
Final adaptive batch 4 did not collapse below initial \(\gamma\)
Comparison to closed form.

The throughput model of [eq:spec] predicts, for \(T_D/T_V\!\approx\!0.05\) and \(\alpha{=}0.385\) at \(\gamma{=}4\), a speedup of \(\approx 1.6\times\) over greedy-only. Greedy-only on the same binary measures \(\sim\!50\) tok/s on this prompt. Measured: \(76.5/50\!=\!1.53\times\). Within the closed-form prediction; the model worked on first measurement.

Llama-3.1-8B-Instruct Q4_K_M end-to-end (2026-04-28).

We repeat the measurement on the frontier-scale model Meta-Llama-3.1-8B-Instruct Q4_K_M (RTX 4070 Laptop, 32 MB L2, \(\gamma{=}2\), threshold \(0.45\)). An eight-prompt sweep at \(n{=}64\) generated tokens per prompt (single-greedy decode, temp=0) yields \[\alpha_\text{Llama8B} \;=\; \mathbf{46.9\%}\quad (\sigma\!=\!0.0\%,\ n_\text{prompts}{=}8)\] with no per-prompt variance under deterministic decoding; the geodesic verifier accepts drafts at a rate determined by the model's residual-stream geometry, not by prompt content. The Llama-8B operating point sits above SmolLM2-135M's \(\alpha{=}38.5\%\), consistent with the hypothesis that as residual-stream manifolds become smoother at scale (lower intrinsic dimension relative to model dimension; see 1) the geodesic draft becomes more accurate. This is the first frontier-model end-to-end speculative-decode acceptance number for the Christoffel-step drafter and is calibrated without the OneDecode prefetch table (which fails to bake on Llama-8B under the present runtime; see 10). Raw CSVs and the bench script are at and .

Compression-cost curve.

Speculative decoding sits on top of the GP/GRC compression layer; the verifier and drafter both pay whatever throughput cost compression imposes, so understanding \(T_V(k)\) matters for picking the operating rank. 1 plots decode throughput as a function of compression rank \(k\) on the companion paper's reference pack (whitepaper_pack_20260427_121815, 4 prompt classes \(\times\) 2 decode budgets, \(n{=}8\) per point). The shape that matters for [eq:spec] is the peak at \(k{=}1024\) (the cache-fit super-baseline), not the absolute numbers, it tells the speculative scheduler that choosing the verifier rank below the natural \(k_{95}\) of attention can accelerate verification rather than slow it, which is why GRC and spec-decode compose multiplicatively rather than antagonistically (7).

Spec-decode acceptance under low-rank compression (2026-04-29).

A direct \(\alpha\) vs. \(k\) sweep on Llama-3.1-8B-Instruct Q4_K_M (RTX 4070 Laptop, \(-n{=}64\), \(\gamma{=}2\), deterministic, attn-only, skip-O) produced a partial result with the per-prompt structure shown in 2. At \(k{\in}\{128\}\) all four prompts collapse to \(\alpha{=}0\%\); at \(k{=}256\) three of four prompts collapse and one holds at \(\alpha{=}56.2\%\), which is above the \(\alpha{=}46.9\%\) headline at \(k{=}\infty\). Decode throughput remained healthy (\(31\)--\(36\) tok/s) on the collapse runs, so the failure mode is verifier rejection, not compute starvation. At \(k{=}512\) the cold-cache PCA exceeded a \(15{:}00\) wallclock budget and the run was abandoned; the wproj cache required to make \(k{=}512\) feasible at warm-cache cost is in the cache backlog.

Per-prompt acceptance under low-rank GP compression on Llama-3.1-8B-Instruct Q4_K_M, RTX 4070 Laptop, \(\gamma{=}2\), deterministic, attn-only, skip-O. The \(k{=}\infty\) row is the uncompressed-attention headline measurement reported above (8 prompts, \(\sigma{=}0\)); the \(k{\in}\{128, 256\}\) rows are the 4-prompt warm-cache subset; \(k{=}512\) failed cold-PCA wallclock. Raw: .
\(k\) \(p_1\) \(p_2\) \(p_3\) \(p_4\)
\(\infty\) \(46.9\%\) (8-prompt mean, \(\sigma{=}0\))
\(256\) timeout \(\mathbf{56.2\%}\) \(0.0\%\) \(0.0\%\)
\(128\) \(0.0\%\) \(0.0\%\) \(0.0\%\) \(0.0\%\)
\(512\) cold-PCA wallclock \(>900\) s; abandoned
Mechanistic interpretation.

The collapse is sharp, not graded, and the \(k{=}256\) outlier is informative. Companion-paper spectral analysis (Stewart 2026) reports per-layer attention \(k_{95}{\approx}1{,}682\) on Llama-3.1-8B; aggressive ranks \(k\in\{128, 256\}\) retain only \(7.6\)% to \(15.2\)% of that energy budget. The geodesic verifier accepts a draft token when the verifier-side logit ranking matches the drafter-side ranking on the top few entries. Under the rejection rule of [eq:accept], \(A(x{\mid}c)\) is bounded by the ratio \(P_V(x)/P_D(x)\); once compression destroys enough of the attention routing subspace, \(P_V\) concentrates on different tokens than \(P_D\) and the ratio collapses to near zero on a per-token basis. The \(k{=}256\) prompt that survives at \(\alpha{=}56.2\%\) is the one whose prompt-conditional attention pattern lies within the surviving routing subspace, and the runtime confirms this is a single category-class boundary effect rather than a stochastic outlier (the other three prompts at \(k{=}256\) are reproducibly at \(\alpha{=}0\%\) across repetitions). The negative finding is therefore geodesic spec-decode and aggressive GP compression do not compose at \(k\ll k_{95}\), which contradicts the multiplicative- composition hypothesis above. The calibration question for v0.2 is located: the breakpoint is between \(k{=}256\) and the \(k{=}1024\) headline, and a focused sweep at \(k\in\{512, 768, 1024\}\) is needed to place it. We expect the breakpoint near \(k\approx 768\), where the retained energy fraction crosses \(45\)%, but this is a prediction, not a measurement.

Decode throughput vs. compression rank \(k\) on Llama-3.1-8B Q4_K_M, RTX 4070 Laptop, 4 prompt classes. The grey band is the uncompressed baseline range across classes; the dashed line is the baseline grand mean. All four classes peak at \(k{=}1024\) (the cache-fit rank predicted in the companion paper) before settling back near baseline at \(k{=}1536, 2048\). This is the empirical input that calibrates \(T_V(k)\) in [eq:spec].

The Instruct-Greedy-EOS Pathology

Why the standard acceptance rule fails on EOS.

Let \(P_D(x\mid c)\) be the drafter's distribution over the next token given context \(c\) and \(P_V(x\mid c)\) be the verifier's. Standard exactness-preserving speculative decoding (Leviathan et al. 2023; Chen et al. 2023) accepts a drafted token \(x\) with probability \[\begin{equation} A(x\mid c)\;=\;\min\!\Bigl(1,\;\tfrac{P_V(x\mid c)}{P_D(x\mid c)}\Bigr). \label{eq:accept} \end{equation}\] Under greedy drafter sampling at temperature \(\tau\!\to\!0\), the drafter is a delta on its argmax: \(P_D(\hat x\mid c)\!\to\!1\) where \(\hat x=\argmax_x \ell_D(x\mid c)\). For an instruct-tuned model on a short or empty user turn, the verifier-greedy argmax is itself the EOS token and the verifier's softmax over EOS at that position routinely clears \(0.95\), but is rarely exactly \(1\). Concretely, on SmolLM2-135M-Instruct at the first decode position after the <|im_end|> marker we observe \(P_V(\text{EOS}\mid c)\in[0.93,0.99]\) and \(P_D(\text{EOS}\mid c)=1\). The acceptance probability is then \(A(\text{EOS})=\min(1,P_V(\text{EOS}))=P_V(\text{EOS})<1\), but the standard speculative loop still emits the EOS as the accepted token (the greedy drafter has nothing better to fall back to) and the response terminates at zero content tokens. The pathology is therefore not a rejection bug; it is the structural fact that \(P_D(\text{EOS})\!\to\!1\) under greedy decoding makes any verifier draft over EOS equivalent to unconditional acceptance of EOS, regardless of how much probability the verifier put on continuation tokens. Fixing this requires breaking \(P_D(\text{EOS})\) away from \(1\) at the start of the response.

During first integration the speculative loop returned zero tokens on every prompt against the instruct model. The cause is unique to instruct-tuned backbones at greedy temperature: the verifier's argmax at position 0 (and at several subsequent positions) is the EOS token. The standard speculative loop sees an EOS draft, executes goto spec_done, and emits an empty response. Earlier published speculative-decoding work (Leviathan et al. 2023; Chen et al. 2023; Cai et al. 2024; Li et al. 2024) does not document this case, primarily because it targets base (non-instruct) models where the greedy distribution does not degenerate into EOS.

Fix.

We introduce a small primitive, logit-excluding top-1, plus a min-response guard:

// runtime/nn/llm.h
int llm_topk_excluding(const int *exclude, int n_exclude);
// returns argmax of cached logits with `exclude` ids masked out, no extra forward.

Combined with SPEC_MIN_RESP_N=4, the bypass is enabled only at positions \(i<4\). After four emitted tokens the standard EOS-respect path takes over. This converts the failure from empty response to empty response only when the model truly intends to stop after at least 4 tokens of content. Four call sites in the speculative loop (accepted-drafts, correction-token, bonus-token, verifier-direct) use the new primitive; without the fix the runtime measures 0 tok/s, with the fix it measures the values in 1. We are not aware of a published treatment of this pathology.

Reproduction

git checkout d57162d
./build_host.ps1
./repair_ott.ps1 -ModelPath models/smollm2-135m-instruct-q8_0.gguf
./build_host/geodessical.exe \
    --model models/smollm2-135m-instruct-q8_0.gguf \
    --ott-full --ott-speculative --ott-spec-batch 4 --ott-spec-thresh 0.45 \
    --prompt "Write a short greeting." --max-tokens 32

Output ends with [SPEC] Done: N tokens (..., acceptance_rate=...) and writes ott_readiness_report.json.

What Composes and What Does Not

Composition matrix with current status.
Composition Mechanism Prior expectation Status
GP \(\times\) speculative Drafter \(T_D\) drops; \(\alpha\) may drop Wash on consumer; positive on tier-asymmetric Measured on 135M-Instruct (\(1.53\times\))
GP \(\times\) AttnRes Block-summary subspace narrows Wash at moderate \(k\); small loss at \(k/d{<}0.3\) Implemented, not measured
GP \(\times\) KV-cache proj Long-context VRAM saving Useful at \(\geq 8\)k context Implemented, not measured at long context
Spec \(\times\) AttnRes Same drafter and verifier path Same as full speculative Implemented, not measured

OneDecode, OTT-OD, OTT-SWARM

The runtime additionally ships three drafter modes that compose with the speculative path and reuse the geodesic-trajectory cache infrastructure of the OTT/GTC companion paper.

OneDecode (--one-decode).

Bakes a geodesic flow map once over a vocabulary slice (default \(V_\text{bake}{=}2048\) tokens) and persists it to ott_one_decode.bin. At decode time, a hit on the table returns (token, confidence) in \(O(1)\), the model forward is skipped. On miss, fall back to the geodesic drafter and then the verifier.

OTT-OD (--ott-od).

Wires OneDecode in as the draft source for speculative decoding. Drafter cost goes to zero on a table hit; verifier (and therefore acceptance distribution) is unchanged.

OTT-SWARM (--ott-swarm K).

Fans out \(K\) candidate tokens per draft slot from OneDecode (or, on miss, from the geodesic drafter) and submits all \(K\) to the verifier in a single batched forward. Structurally similar to Medusa-style multi-head drafting (Cai et al. 2024), except candidates come from a baked flow map rather than learned auxiliary heads.

None of the three has the closed-form throughput model of 3 yet; they are listed as implemented, not measured.

Limitations

Open issues.
  • --ott-swarm-k 8 crashes; the SWARM-K hit count in 1 is \(0\) for that reason.

  • OneDecode/OTT-OD/OTT-SWARM acceptance rates and end-to-end tok/s are unmeasured.

  • Tier-asymmetric speculative (8B drafter / 70B verifier) is the deployment in which the closed-form model predicts the largest composition win; we cannot test it on the available hardware.

  • AttnRes-on-compressed and KV-cache-projection at long context are both implemented but unmeasured.

Methodological gaps.

Single model (135M-Instruct) for end-to-end measurement; single hardware target; no behavioural-quality benchmarks (MMLU, GSM8K, HumanEval) on the accepted-token sequence. The closed-form throughput model in 3 uses companion-paper numbers extrapolated to this drafter; whether those extrapolations transfer is the central uncertainty.

Cai, Tianle, Yuhong Li, Zhengyang Geng, et al. 2024. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. https://arxiv.org/abs/2401.10774.
Chen, Charlie, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. "Accelerating Large Language Model Decoding with Speculative Sampling." arXiv Preprint. https://arxiv.org/abs/2302.01318.
Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. https://arxiv.org/abs/2309.06180.
Leviathan, Yaniv, Matan Kalman, and Yossi Matias. 2023. "Fast Inference from Transformers via Speculative Decoding." ICML.
Li, Yuhui, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. https://arxiv.org/abs/2401.15077.
Stewart, William Ken Ohara. 2026. "Geodesic Runtime Compression: A Calibration-Free, Super-Baseline Attention Compression." HyperTensor Companion Paper, to Appear.
Xiao, Guangxuan, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient Streaming Language Models with Attention Sinks. https://arxiv.org/abs/2309.17453.