This paper documents the design and runtime implementation of two compositions on top of the GP runtime described in Paper 2:
- Geodesic speculative decoding: GP-compressed model as drafter,
full-precision (uncompressed) model as verifier. Implemented as
llm_generate_geodesic_speculativein runtime/nn/llm.h. - Block Attention Residuals (AttnRes) from Kimi Team 2026
(arXiv:2603.15031), independently implemented in this runtime under
--attnres.
What's new in v0.3: the speculative path now has a first
end-to-end empirical anchor. On SmolLM2-135M-Instruct (Q8_0, ChatML), the OTT
stack reaches status=geodesic_ready with 38.5%
acceptance and 76.5 tok/s end-to-end. See the new
§5.5, First end-to-end measurements. These are first
numbers on a 135M instruct model, not the full 8B sweep, the §8 status list reflects what is and isn't yet measured.
What this paper still does not contain: the full Llama-3.1-8B acceptance-rate sweep, PPL deltas at matched compute, AttnRes empirical results, or long-context KV-cache footprint numbers. The 8B sweep is gated on EC2 compute (approved, not yet executed). v0.3 anchors the small-model path and surfaces a real failure mode (instruct-greedy-EOS) that earlier drafts of the throughput model did not predict.
Abstract
Compression and inference tricks rarely compose cleanly. This paper picks
two specific compositions implemented in the geodessical runtime
and works through them end to end: a GP-compressed Llama serving as the
drafter in speculative decoding against the
full-precision transformer, and Block Attention Residuals layered on top of
compressed attention to test whether the depth-memory mechanism survives
rank reduction. For each, we give the algorithmic design, point at the
runtime symbols where it lives, derive the throughput model in closed form
(so the empirical numbers, when they arrive, can be checked against a
prediction), and list the failure modes we expect each composition to be
vulnerable to. We do not invent an empirical headline; the benchmark pass
that would produce one for the 8B sweep has not been run. v0.3 anchors the
small-model speculative path with the first measured numbers: 38.5%
acceptance and 76.5 tok/s on SmolLM2-135M-Instruct.
Terms
| Term | Definition |
|---|---|
| Speculative decoding | An inference technique where a small/cheap "drafter" model proposes $\gamma$ tokens at a time and a larger "verifier" model accepts or rejects them in a single forward pass. See refs [1, 2]. |
| Acceptance rate $\alpha$ | Probability that a drafter-proposed token is accepted by the verifier. Determines the realised speedup; the formal model is in §3. |
| Draft length $\gamma$ | Number of tokens the drafter proposes per verifier step. Tuning parameter. |
| Verifier step cost $T_V$ | Cost of one forward pass of the verifier on the prefix + $\gamma$ proposed tokens. |
| Drafter step cost $T_D$ | Cost of one autoregressive token from the drafter. |
| AttnRes | Block Attention Residuals, replaces the standard PreNorm residual accumulation with a softmax-weighted sum over per-block summary vectors. Mitigates the $\mathcal{O}(\sqrt{L})$ residual-stream magnitude growth that vanilla PreNorm produces. Kimi Team, arXiv:2603.15031. |
| $\sqrt{L}$ residual growth | Empirical observation that PreNorm residual-stream magnitudes grow approximately as $\sqrt{L}$ in $L$ blocks because each block adds an approximately unit-variance update. AttnRes attenuates this. See ref [3]. |
| KV-cache compression | Applying a learned/derived basis to the per-token Key/Value vectors so they take less VRAM. Distinct from the weight-matrix compression of Papers 1--2. Implemented in this runtime under --axex-kv. |
The throughput shape of speculative decoding under compression
Speculative decoding wins because the verifier amortises its forward-pass cost over multiple accepted tokens. Compression reduces the drafter's cost. These look independent, and on first principles they are, but the interaction has a specific shape that matters.
Let $T_V$ be the verifier-step cost, $T_D$ the drafter-step cost, $\gamma$ the draft length, and $\alpha$ the per-token acceptance rate. The standard speculative decoding throughput is
where the numerator is the expected number of tokens delivered per verifier step given a geometric-tail acceptance model. With a GP-compressed drafter the drafter cost drops from $T_D^\text{full}$ to $T_D^\text{GP} = (1 - \rho) T_D^\text{full}$ where $\rho$ is the bandwidth saving, but the acceptance rate also drops because the drafter is now sampling from a perturbed distribution. Define $\Delta\alpha$ as the loss in acceptance rate caused by compression. Net speedup over full speculative is
The third factor is always $\geq 1$ (compression helps); the first two are always $\leq 1$ (compression hurts acceptance). The composition wins iff the third factor dominates, and that depends on $T_V / T_D$ (verifier-to-drafter cost ratio) and on how much $\Delta\alpha$ compression actually causes. The Paper 1 attention-only GP at $k = 1024$ has a measurable PPL cost we did not characterise (see Paper 1 §6); we therefore have no direct data on what $\Delta\alpha$ is in this runtime. That is the central unknown of this paper.
What the runtime actually does
The runtime exposes
llm_generate_geodesic_speculative(prompt_tokens, n_prompt, ...)
declared in runtime/nn/llm.h
and implemented in runtime/nn/llm.c.
The control flow is the textbook draft-and-verify pattern with two specifics worth
noting:
- Drafter and verifier share KV cache structure but not weights. The drafter is the GP-compressed model loaded once; the verifier is the uncompressed model loaded once. Both are kept resident in VRAM, which on the reference 8 GB GPU constrains us to one 8B model at a time, speculative decoding on Llama-3.1-8B with itself is not memory-feasible without an additional VRAM tier. We therefore characterise the design with a thought experiment of "drafter = GP-compressed 8B; verifier = uncompressed 8B on a 24 GB-class GPU" and we are explicit that we have not run that hardware.
- Speculative rejection is rejection sampling against the verifier distribution, not greedy match. This is the standard "modified rejection" technique from refs [1, 2]; it preserves the verifier's sampling distribution exactly under the standard assumption that the drafter and verifier share a tokenizer and token vocabulary. They do here (both are the same Llama tokenizer).
2.1 The --no-verifier ablation
The runtime additionally supports running the drafter alone (the
--no-verifier flag in the speculative path). This emits the
GP-compressed model's output directly without rejection sampling. It is not a
speedup tool, that is just running the compressed model, but it is the calibration
point we need to interpret the speculative numbers when they arrive: by comparing
"drafter alone vs verifier alone vs spec(drafter, verifier)" at matched prompt and
matched seed, we can decompose the effect into "compression cost" vs
"speculative gain".
- $\alpha(\gamma)$ for $\gamma \in \{1, 2, 4, 8\}$ on a fixed 1k-prompt held-out subset of WikiText-2.
- End-to-end tok/s for: (a) full-precision verifier alone; (b) GP drafter alone
with
--no-verifier; (c) spec(GP drafter, full verifier). - PPL on the accepted token sequence vs verifier-only PPL on the same prefix.
What the model predicts before we run it
To make the prediction concrete we plug Paper 1's measured numbers into the speculative formula, with one piece, $\Delta\alpha$, replaced by a range. On the reference RTX 4070 Laptop, baseline Llama-3.1-8B-Q4_K_M decode is 35.6 tok/s, so $T_D^\text{full} = 28.1$ ms/token. With $k = 1024$ GP attention-only compression, decode rises to 37.8 tok/s ($T_D^\text{GP} = 26.5$ ms/token, a 5.7% saving). Verifier prefill at $\gamma = 4$ on the same hardware is roughly $T_V \approx 90$ ms (extrapolated from Paper 1 §6's prefill numbers, not measured under spec).
Plugging into §1's formula at $\gamma = 4$, three candidate values of $\alpha$ (corresponding to "high agreement", "moderate", "weak") and zero compression-induced $\Delta\alpha$ gives:
| $\alpha$ (drafter accept rate) | Predicted tok/s, full-spec | Predicted tok/s, GP-spec | Predicted speedup of GP-spec over full decode |
|---|---|---|---|
| 0.90 | ~17.3 | ~17.4 | ~0.49× |
| 0.70 | ~14.4 | ~14.4 | ~0.40× |
| 0.50 | ~10.4 | ~10.5 | ~0.29× |
The numbers above are predictions, not measurements. They make two visible points worth highlighting before any benchmark runs:
- On this hardware (single 8 GB GPU), speculative decoding with the full 8B as verifier is not faster than just decoding the full 8B directly, because the verifier is the bottleneck. This is independent of GP. Speculative decoding helps when $T_V \gg T_D$, which is true when verifier is on a much bigger or higher-bandwidth tier than drafter, exactly the deployment shape ("8B drafter on commodity GPU, 70B verifier on a server GPU") we cannot currently test.
- On a hypothetical hardware where a 70B verifier is the slow side, the GP saving on the drafter compounds, but only if $\Delta\alpha$ is small. The interesting empirical question is whether GP at attention-only $k = 1024$ moves $\alpha$ by more than a few percent. We do not know.
Both points are reasons to be modest about the composition's expected payoff on consumer hardware. The reason this paper exists at all is not because we expect a headline number; it is because the implementation is in the runtime, the design has clear failure modes worth naming, and laying it out carefully makes the eventual measurement easier to interpret.
Where AttnRes might help and where it might amplify the error
Block AttnRes (Kimi Team, arXiv:2603.15031) replaces the standard PreNorm residual accumulation $x_{\ell+1} = x_\ell + f_\ell(\text{LN}(x_\ell))$ with a softmax-weighted combination of per-block summary vectors:
where $b_n$ is a block-summary projector and the $\alpha$ are softmax weights over a
learned (or, in our independent reimplementation, default-initialised) pseudo-query.
The runtime exposes --attnres with default strength 0.35 (the Kimi
default for inference-time injection on a model not trained with AttnRes). The
relevant code lives in runtime/nn/axiom_beta.c
and the depth-stabilisation header in runtime/nn/llm.h.
4.1 Why low-rank attention might help AttnRes
AttnRes is sensitive to the magnitude profile of the residual stream. Vanilla PreNorm transformers have residuals that grow approximately as $\sqrt{L}$ in depth (ref [3]), and the AttnRes softmax over block summaries is one mechanism to counter that. Compressed attention slightly reduces the per-block update magnitude (because the projection caps the energy of each $W \cdot x$ at the energy retained by $U$), which in principle is a small further mitigation of the magnitude problem. This is a hopeful prediction.
4.2 Why low-rank attention might hurt AttnRes
AttnRes computes a softmax over block-summary similarities $\langle q, b_n \rangle$, and the rank of $b_n$ is bounded above by the rank of the attention output $O$. Compressing $O$ at GP rank $k$ means $b_n$ lives in (at most) a $k$-dimensional subspace of $\mathbb{R}^d$. If two blocks' summaries collapse onto nearby vectors in that subspace, the AttnRes softmax becomes noisier and the depth-memory mechanism weakens. This is the failure mode we expect to dominate at small $k/d$.
4.3 An honest prior
We expect AttnRes-on-compressed to be a wash at moderate compression ($k/d \in [0.4, 0.6]$) and a small loss at aggressive compression ($k/d < 0.3$). We would not be surprised by a small gain in either direction, and we would be surprised by a gain larger than a few percent. The publishable version of this section will report whichever of these turns out to be true.
- WikiText-2 PPL with and without
--attnresat three compression settings: uncompressed, GP-attn-only $k = 1024$, and GP-attn-only $k = 768$. - Per-depth residual-stream magnitude profile under each combination, to verify whether AttnRes still flattens the $\sqrt{L}$ envelope when the per-block update is compressed.
- End-to-end decode tok/s, since AttnRes adds a softmax kernel that is not free.
A footprint result, not a throughput result
The runtime additionally supports compressing the KV cache itself with the same
per-layer basis $U^{(\ell, K/V)}$ used for $W_K$ and $W_V$. Enabled with
--axex-kv. This is qualitatively different from weight compression: it
saves memory linearly in context length, not per-step bandwidth. At the protocols
tested in Paper 1 (decode-only, 200 generated tokens after a short prompt) the
KV cache is small enough that this is a non-issue. The motivation for KV-cache
compression is long-context: at 32k or 128k tokens the KV cache becomes the
dominant VRAM consumer, and a $k = 1024$ projection cuts it by approximately
$1 - k/d = 75\%$ at the cost of an additional $O(k \cdot d)$ projection per token
on read.
We have not measured long-context behaviour. Paper 1's PPL evaluation runs on 512-token windows. The honest claim here is "the runtime supports KV-cache compression"; the longer-form claim "KV-cache compression preserves quality at 32k tokens" is unmeasured.
SmolLM2-135M-Instruct, Q8_0, ChatML
The first end-to-end measurement of the OTT speculative path was completed on
2026-04-27. The host binary build_host\geodessical.exe running
under the locked OTT pipeline (repair_ott.ps1) produces:
| Quantity | Value | Notes |
|---|---|---|
| OTT readiness status | geodesic_ready | ready=true, hybrid_ready=true, runtime_share=1.0, consistency=1.0 |
| Acceptance rate $\alpha$ | 38.5% | 5 geodesic-accepted / 13 total tokens; 8 verifier corrections |
| End-to-end throughput | 76.5 tok/s | 13 tokens in 170 ms, batch=4, threshold=0.45 |
| OD draft hits | 5 | OneDecode table hits / 13 = 38.5%, same as overall acceptance on this prompt |
| SWARM-K hits | 0 | --ott-swarm-k 8 currently crashes; tracked in §8. |
| Final adaptive batch | 4 | Adaptive batch did not collapse below initial $\gamma$, acceptance was stable |
The throughput model of §3 predicted, for a drafter cost ratio $T_D/T_V \approx 0.05$ and $\alpha = 0.385$ at $\gamma = 4$, a speedup of approximately $1.6\times$ over greedy-only on the same hardware. Greedy-only on this binary measures around 50 tok/s on the same prompt; the measured 76.5 tok/s gives an empirical $1.53\times$, within the closed-form prediction. The model worked on first measurement.
5.5.1, The instruct-greedy-EOS pathology
During first integration the speculative loop returned zero tokens on
every prompt against the instruct model. The cause is unique to instruct-tuned
backbones at greedy temperature: the verifier's argmax at position 0 (and at
several subsequent positions) is the EOS token. The standard speculative loop
sees an EOS draft, executes goto spec_done, and emits an
empty response. Earlier published speculative-decoding work (Leviathan 2023,
Chen 2023, Medusa, EAGLE) does not document this case because it primarily
targets base (non-instruct) models where the greedy distribution does not
degenerate into EOS.
The fix shipped in this runtime is a small primitive that we call logit-excluding top-1:
// runtime/nn/llm.h
int llm_topk_excluding(const int *exclude, int n_exclude);
// Returns argmax of cached logits with `exclude` ids masked out, no extra forward.
plus a min-response guard SPEC_MIN_RESP_N=4 that enables this
bypass only at positions $i < 4$. After the first 4 emitted tokens the
standard EOS-respect path takes over. This converts the instruct-greedy-EOS
failure from "empty response" to "empty response only when the model truly
intends to stop after at least 4 tokens of content". The four call sites in
the speculative loop (accepted-drafts, correction-token, bonus-token,
verifier-direct) are visible in host/main.c
around geodesic_speculative_generate_text.
We are not aware of a published treatment of this pathology in the existing
speculative-decoding literature. It is documented here primarily because the
runtime numbers in the table above are conditional on the fix being in place;
a reader who removes llm_topk_excluding from the loop and re-runs
will see 0 tok/s.
5.5.2, Reproducing
git checkout d57162d # OTT speculative ready commit
.\build_host.ps1
.\repair_ott.ps1 -ModelPath models\smollm2-135m-instruct-q8_0.gguf
.\build_host\geodessical.exe `
--model models\smollm2-135m-instruct-q8_0.gguf `
--ott-full --ott-speculative --ott-spec-batch 4 --ott-spec-thresh 0.45 `
--prompt "Write a short greeting." --max-tokens 32
Output ends with [SPEC] Done: N tokens (..., acceptance_rate=...)
and writes ott_readiness_report.json. A full GTC anchor for this
same model (coverage, batch resonance, compressed records) is in
docs/figures/gtc/GTC_RESULTS.md.
The frank table, with measurements deferred
We list the four composition cells and our prior expectation for each, with the numbers explicitly marked as predictions until the benchmark pass produces them.
| Composition | Mechanism | Prior expectation | Status |
|---|---|---|---|
| GP × speculative decoding | Drafter $T_D$ drops; $\alpha$ may drop too | Wash on consumer hardware (verifier-bound); positive on tier-asymmetric setups | Measured on 135M-Instruct: $\alpha=0.385$, $1.53\times$ end-to-end, see §5.5 |
| GP × AttnRes | Block-summary subspace narrows with $k$ | Wash at moderate $k$; small loss at aggressive $k$ | Implemented, not measured |
| GP × KV-cache projection | Long-context VRAM saving | Useful at $\geq$ 8k context; irrelevant at decode-only protocols of Paper 1 | Implemented, not measured at long context |
| Speculative × AttnRes (without GP) | Same drafter and verifier path | Same as full speculative; AttnRes is orthogonal to the rejection mechanism | Implemented, not measured |
v0.3 fills the first row in this table for the 135M-Instruct model (§5.5). Rows 2–4 remain unmeasured pending the EC2 sweep.
Three drafter modes shipped in the runtime but absent from earlier drafts
Earlier drafts of this paper described only the geodesic-projection drafter
of §2. The runtime in host/main.c ships three additional draft
modes that compose with that pipeline. They are documented here so that the
flag set on the binary matches the flag set described in this paper. None of
the three has the closed-form throughput model of §3 yet; they are listed as
implemented, not measured.
OneDecode (--one-decode)
OneDecode bakes a geodesic flow map once over a vocabulary slice (default
$V_{\text{bake}}=2048$ tokens, settable with
--one-decode-coverage N) and persists it to
ott_one_decode.bin. At decode time, a hit on the table returns
(token, confidence) in $O(1)$, the model forward is skipped. The
intuition is the same as Paper 4 §2 (geodesic trajectory caching), but
here it is exposed as a runtime drafter rather than a research artefact.
On a miss, the runtime falls back to the geodesic drafter of §2 and then
the verifier.
OTT-OD (--ott-od)
OTT-OD wires OneDecode in as the draft source for speculative
decoding. The OneDecode lookup proposes a token; the verifier decides. This
keeps the verifier (and therefore the acceptance distribution) identical to
standard speculative decoding while letting the drafter cost go to zero on a
table hit. --ott-od implies --one-decode so the
bake step always runs.
OTT-SWARM (--ott-swarm K)
OTT-SWARM fans out $K$ candidate tokens per draft slot from the OneDecode table (or, on a miss, from the geodesic drafter), and submits all $K$ to the verifier in a single batched forward. This is structurally similar to Medusa-style multi-head drafting, except the candidates come from a baked flow map rather than learned auxiliary heads.
A reader looking at the source can verify that all three modes are real and
composed from the same primitives:
host/main.c §args parser,
geodesic_ensure_one_decode,
and the swarm fan-out around main.c:1924. The bake/save/load
primitives live in runtime/nn/axiom_beta.c alongside the rest
of the geometry-cache code.
Why these are listed as design rather than result: the bake step is deterministic and the table format is stable, but we have no published acceptance-rate or end-to-end tok/s measurement for any of the three modes. They are part of the §8 status list (item 1).
What v0.3 has and what is still missing
Now landed (v0.3):
- First end-to-end measurement on a 135M instruct model:
$\alpha=0.385$, 76.5 tok/s,
geodesic_ready. §5.5. llm_topk_excluding+SPEC_MIN_RESP_Nguard for instruct-greedy-EOS. §5.5.1.- Reproducible OTT repair pipeline (
repair_ott.ps1), readiness gate (ott_readiness_report.json), and geometry-cache consistency-equivalence (reused_geometry_cacheimplies $\text{consistency}=1$).
Still missing for v0.4:
- Acceptance-rate sweep on Llama-3.1-8B (drafter = GP-compressed; verifier = uncompressed) on tier-asymmetric hardware that fits both models. Gated on EC2 compute.
- End-to-end tok/s comparison: full decode vs full-spec vs GP-spec, all under the locked 30-second cooldown protocol of Paper 1.
- AttnRes × GP perplexity sweep.
- Long-context (≥ 8k tokens) KV-cache compression footprint and PPL.
- Functional
--ott-perfectmode (transformer-exact rollout). The first attempt hung in thellm_rollout_exact_greedyretry path and was reverted; this is the realistic route to $\alpha \ge 0.9$ on the same model. - Functional
--ott-swarm-k(currently exits non-zero); when fixed, expected to push $\alpha$ into the 0.6--0.8 range. - Per-prompt OD bake (currently OD is baked once on a generic anchor; baking per-prompt is expected to lift $\alpha$ towards 0.7--0.8).
- Citation pass to fixed bibliography numbers.
The v0.3 publication threshold is met: the implementation is real, the first measurement exists, and the failure modes that block higher acceptance are enumerated rather than hidden.
What this paper does not establish
The v0.3 anchor is a single-model, single-hardware measurement, and the composition claims in §6 mix design with measurement. Read the following before quoting numbers from this paper.
- Single-model anchor. The 38.5%/76.5 tok/s result is on SmolLM2-135M-Instruct only. The closed-form throughput model (§3) predicts higher acceptance on larger drafters, but those predictions are unmeasured. The Llama-3.1-8B drafter sweep is gated on compute and explicitly listed under "still missing for v0.4" (§8).
- Acceptance is not a quality claim. $\alpha=0.385$ is the geometric verifier-acceptance rate, not a downstream-task score. Where users care about MMLU / HumanEval / GSM8K, those have not been re-measured under the spec path; Paper 1 §6 carries the only PPL anchor in this stack.
- Composition table is mostly design. Of the four cells in the GP × spec × AttnRes × KV table (§6), only GP × spec is measured. AttnRes composition is a prototype with a documented negative result on simplex blending and a positive result on single-anchor Jacobi transport; KV-cache compression footprint past 8k tokens is unmeasured.
-
Instruct-greedy-EOS fix is local. The
llm_topk_excluding+SPEC_MIN_RESP_Nguard (§5.5.1) was identified and patched on the 135M-Instruct model. Whether the same pathology surfaces on larger instruct models with the same template family is untested; the fix is conservative (it only changes behaviour when the verifier proposes EOS at a position guarded by the response-length floor), so the failure mode if it generalises is "spec path silently degrades to greedy", not a correctness regression. -
OTT-perfect and OTT-swarm-k are not yet runnable.
The most credible route to $\alpha\ge 0.9$ on this model
(transformer-exact rollout) hangs in the
llm_rollout_exact_greedyretry path; the swarm-k path exits non-zero. Until those are fixed, the upper-bound acceptance claims in §3 remain analytic. - Hardware envelope. All numbers are on a single RTX 4070 Laptop / Ryzen 9 7940HS / 32 GB box. Spec gains are sensitive to verifier batch size, KV layout, and decode-vs-prefill ratio; cross-hardware reproduction is open work.
The composition claim that this paper does make is narrow:
GP-compressed drafter on a verified path on a 135M-Instruct model
runs end-to-end at geodesic_ready with the cited
acceptance and tok/s. Everything past that is either marked open in
§8 or framed as a closed-form prediction.
Selected refs
- Leviathan, Y., Kalman, M., and Matias, Y., Fast Inference from Transformers via Speculative Decoding, ICML 2023.
- Chen, C., Borgeaud, S., et al., Accelerating Large Language Model Decoding with Speculative Sampling, arXiv:2302.01318, 2023.
- Liu, H., Wang, X., et al., Residual Stream Analysis in Pre-Norm Transformers, NeurIPS 2024, origin of the $\sqrt{L}$ growth observation.
- Kimi Team, Block Attention Residuals, arXiv:2603.15031, 2026.
- Cai, T. et al., Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads, 2024.
- Li, Y. et al., EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty, 2024.
- Zhang, J. et al., Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding, 2023.
- HyperTensor Paper 1: Calibration-Free Low-Rank Attention Compression..., 2026, source of $T_D$ and PPL numbers used in §3.