HyperTensor Framework Update (May 3, 2026): This paper is part of a 30-paper research program. The complete framework now includes the k-manifold living-model stack (Papers XI–XV, 96% complete), a geometric approach to the Riemann Hypothesis via Z2 symmetry (Papers XVI–XVIII), and the ISAGI adaptive living model. See the unified whitepaper and GitHub repository for the full research program.
Paper 3 · April 2026 · v0.3

Composing Compression

Geodesic speculative decoding and Attention Residuals: how a compressed-manifold model serves as a draft generator against the full-precision transformer, and how the two compress-and-skip mechanisms compose. v0.3 adds the first end-to-end measurement.

By William Ken Ohara Stewart (NagusameCS) · github.com/NagusameCS/HyperTensor
38.5% Speculative
acceptance rate
76.5 tok/s on
SmolLM2-135M
1.53× Empirical speedup
vs greedy-only
5 / 13 OneDecode
draft hits
Read the paper Reproduce Source
Scope of this paper, read first (v0.3, 2026-04-27)

This paper documents the design and runtime implementation of two compositions on top of the GP runtime described in Paper 2:

  1. Geodesic speculative decoding: GP-compressed model as drafter, full-precision (uncompressed) model as verifier. Implemented as llm_generate_geodesic_speculative in runtime/nn/llm.h.
  2. Block Attention Residuals (AttnRes) from Kimi Team 2026 (arXiv:2603.15031), independently implemented in this runtime under --attnres.

What's new in v0.3: the speculative path now has a first end-to-end empirical anchor. On SmolLM2-135M-Instruct (Q8_0, ChatML), the OTT stack reaches status=geodesic_ready with 38.5% acceptance and 76.5 tok/s end-to-end. See the new §5.5, First end-to-end measurements. These are first numbers on a 135M instruct model, not the full 8B sweep, the §8 status list reflects what is and isn't yet measured.

What this paper still does not contain: the full Llama-3.1-8B acceptance-rate sweep, PPL deltas at matched compute, AttnRes empirical results, or long-context KV-cache footprint numbers. The 8B sweep is gated on EC2 compute (approved, not yet executed). v0.3 anchors the small-model path and surfaces a real failure mode (instruct-greedy-EOS) that earlier drafts of the throughput model did not predict.

§0, Abstract

Abstract

Compression and inference tricks rarely compose cleanly. This paper picks two specific compositions implemented in the geodessical runtime and works through them end to end: a GP-compressed Llama serving as the drafter in speculative decoding against the full-precision transformer, and Block Attention Residuals layered on top of compressed attention to test whether the depth-memory mechanism survives rank reduction. For each, we give the algorithmic design, point at the runtime symbols where it lives, derive the throughput model in closed form (so the empirical numbers, when they arrive, can be checked against a prediction), and list the failure modes we expect each composition to be vulnerable to. We do not invent an empirical headline; the benchmark pass that would produce one for the 8B sweep has not been run. v0.3 anchors the small-model speculative path with the first measured numbers: 38.5% acceptance and 76.5 tok/s on SmolLM2-135M-Instruct.

§0.5, Glossary

Terms

TermDefinition
Speculative decodingAn inference technique where a small/cheap "drafter" model proposes $\gamma$ tokens at a time and a larger "verifier" model accepts or rejects them in a single forward pass. See refs [1, 2].
Acceptance rate $\alpha$Probability that a drafter-proposed token is accepted by the verifier. Determines the realised speedup; the formal model is in §3.
Draft length $\gamma$Number of tokens the drafter proposes per verifier step. Tuning parameter.
Verifier step cost $T_V$Cost of one forward pass of the verifier on the prefix + $\gamma$ proposed tokens.
Drafter step cost $T_D$Cost of one autoregressive token from the drafter.
AttnResBlock Attention Residuals, replaces the standard PreNorm residual accumulation with a softmax-weighted sum over per-block summary vectors. Mitigates the $\mathcal{O}(\sqrt{L})$ residual-stream magnitude growth that vanilla PreNorm produces. Kimi Team, arXiv:2603.15031.
$\sqrt{L}$ residual growthEmpirical observation that PreNorm residual-stream magnitudes grow approximately as $\sqrt{L}$ in $L$ blocks because each block adds an approximately unit-variance update. AttnRes attenuates this. See ref [3].
KV-cache compressionApplying a learned/derived basis to the per-token Key/Value vectors so they take less VRAM. Distinct from the weight-matrix compression of Papers 1--2. Implemented in this runtime under --axex-kv.
§1, Why compose

The throughput shape of speculative decoding under compression

Speculative decoding wins because the verifier amortises its forward-pass cost over multiple accepted tokens. Compression reduces the drafter's cost. These look independent, and on first principles they are, but the interaction has a specific shape that matters.

Let $T_V$ be the verifier-step cost, $T_D$ the drafter-step cost, $\gamma$ the draft length, and $\alpha$ the per-token acceptance rate. The standard speculative decoding throughput is

$$\text{tok/s}_\text{spec} \;=\; \frac{\mathbb{E}[\text{accepted}]}{\gamma\,T_D + T_V} \;=\; \frac{1 - \alpha^{\gamma+1}}{(1-\alpha)\,(\gamma\,T_D + T_V)}$$

where the numerator is the expected number of tokens delivered per verifier step given a geometric-tail acceptance model. With a GP-compressed drafter the drafter cost drops from $T_D^\text{full}$ to $T_D^\text{GP} = (1 - \rho) T_D^\text{full}$ where $\rho$ is the bandwidth saving, but the acceptance rate also drops because the drafter is now sampling from a perturbed distribution. Define $\Delta\alpha$ as the loss in acceptance rate caused by compression. Net speedup over full speculative is

$$\frac{\text{tok/s}_\text{spec, GP}}{\text{tok/s}_\text{spec, full}} \;=\; \frac{1 - (\alpha-\Delta\alpha)^{\gamma+1}}{1 - \alpha^{\gamma+1}} \cdot \frac{1 - \alpha}{1 - \alpha + \Delta\alpha} \cdot \frac{\gamma T_D^\text{full} + T_V}{\gamma (1-\rho) T_D^\text{full} + T_V}.$$

The third factor is always $\geq 1$ (compression helps); the first two are always $\leq 1$ (compression hurts acceptance). The composition wins iff the third factor dominates, and that depends on $T_V / T_D$ (verifier-to-drafter cost ratio) and on how much $\Delta\alpha$ compression actually causes. The Paper 1 attention-only GP at $k = 1024$ has a measurable PPL cost we did not characterise (see Paper 1 §6); we therefore have no direct data on what $\Delta\alpha$ is in this runtime. That is the central unknown of this paper.

§2, Geodesic speculative decoding (implementation)

What the runtime actually does

The runtime exposes llm_generate_geodesic_speculative(prompt_tokens, n_prompt, ...) declared in runtime/nn/llm.h and implemented in runtime/nn/llm.c. The control flow is the textbook draft-and-verify pattern with two specifics worth noting:

  1. Drafter and verifier share KV cache structure but not weights. The drafter is the GP-compressed model loaded once; the verifier is the uncompressed model loaded once. Both are kept resident in VRAM, which on the reference 8 GB GPU constrains us to one 8B model at a time, speculative decoding on Llama-3.1-8B with itself is not memory-feasible without an additional VRAM tier. We therefore characterise the design with a thought experiment of "drafter = GP-compressed 8B; verifier = uncompressed 8B on a 24 GB-class GPU" and we are explicit that we have not run that hardware.
  2. Speculative rejection is rejection sampling against the verifier distribution, not greedy match. This is the standard "modified rejection" technique from refs [1, 2]; it preserves the verifier's sampling distribution exactly under the standard assumption that the drafter and verifier share a tokenizer and token vocabulary. They do here (both are the same Llama tokenizer).

2.1 The --no-verifier ablation

The runtime additionally supports running the drafter alone (the --no-verifier flag in the speculative path). This emits the GP-compressed model's output directly without rejection sampling. It is not a speedup tool, that is just running the compressed model, but it is the calibration point we need to interpret the speculative numbers when they arrive: by comparing "drafter alone vs verifier alone vs spec(drafter, verifier)" at matched prompt and matched seed, we can decompose the effect into "compression cost" vs "speculative gain".

What we will measure (planned)
  • $\alpha(\gamma)$ for $\gamma \in \{1, 2, 4, 8\}$ on a fixed 1k-prompt held-out subset of WikiText-2.
  • End-to-end tok/s for: (a) full-precision verifier alone; (b) GP drafter alone with --no-verifier; (c) spec(GP drafter, full verifier).
  • PPL on the accepted token sequence vs verifier-only PPL on the same prefix.
§3, Closed-form throughput estimates

What the model predicts before we run it

To make the prediction concrete we plug Paper 1's measured numbers into the speculative formula, with one piece, $\Delta\alpha$, replaced by a range. On the reference RTX 4070 Laptop, baseline Llama-3.1-8B-Q4_K_M decode is 35.6 tok/s, so $T_D^\text{full} = 28.1$ ms/token. With $k = 1024$ GP attention-only compression, decode rises to 37.8 tok/s ($T_D^\text{GP} = 26.5$ ms/token, a 5.7% saving). Verifier prefill at $\gamma = 4$ on the same hardware is roughly $T_V \approx 90$ ms (extrapolated from Paper 1 §6's prefill numbers, not measured under spec).

Plugging into §1's formula at $\gamma = 4$, three candidate values of $\alpha$ (corresponding to "high agreement", "moderate", "weak") and zero compression-induced $\Delta\alpha$ gives:

$\alpha$ (drafter accept rate)Predicted tok/s, full-specPredicted tok/s, GP-specPredicted speedup of GP-spec over full decode
0.90~17.3~17.4~0.49×
0.70~14.4~14.4~0.40×
0.50~10.4~10.5~0.29×

The numbers above are predictions, not measurements. They make two visible points worth highlighting before any benchmark runs:

  1. On this hardware (single 8 GB GPU), speculative decoding with the full 8B as verifier is not faster than just decoding the full 8B directly, because the verifier is the bottleneck. This is independent of GP. Speculative decoding helps when $T_V \gg T_D$, which is true when verifier is on a much bigger or higher-bandwidth tier than drafter, exactly the deployment shape ("8B drafter on commodity GPU, 70B verifier on a server GPU") we cannot currently test.
  2. On a hypothetical hardware where a 70B verifier is the slow side, the GP saving on the drafter compounds, but only if $\Delta\alpha$ is small. The interesting empirical question is whether GP at attention-only $k = 1024$ moves $\alpha$ by more than a few percent. We do not know.

Both points are reasons to be modest about the composition's expected payoff on consumer hardware. The reason this paper exists at all is not because we expect a headline number; it is because the implementation is in the runtime, the design has clear failure modes worth naming, and laying it out carefully makes the eventual measurement easier to interpret.

§4, Attention Residuals under compression

Where AttnRes might help and where it might amplify the error

Block AttnRes (Kimi Team, arXiv:2603.15031) replaces the standard PreNorm residual accumulation $x_{\ell+1} = x_\ell + f_\ell(\text{LN}(x_\ell))$ with a softmax-weighted combination of per-block summary vectors:

$$x_{\ell+1} \;=\; \sum_{n \leq \ell} \alpha_{n \to \ell} \, b_n(x_\ell)$$

where $b_n$ is a block-summary projector and the $\alpha$ are softmax weights over a learned (or, in our independent reimplementation, default-initialised) pseudo-query. The runtime exposes --attnres with default strength 0.35 (the Kimi default for inference-time injection on a model not trained with AttnRes). The relevant code lives in runtime/nn/axiom_beta.c and the depth-stabilisation header in runtime/nn/llm.h.

4.1 Why low-rank attention might help AttnRes

AttnRes is sensitive to the magnitude profile of the residual stream. Vanilla PreNorm transformers have residuals that grow approximately as $\sqrt{L}$ in depth (ref [3]), and the AttnRes softmax over block summaries is one mechanism to counter that. Compressed attention slightly reduces the per-block update magnitude (because the projection caps the energy of each $W \cdot x$ at the energy retained by $U$), which in principle is a small further mitigation of the magnitude problem. This is a hopeful prediction.

4.2 Why low-rank attention might hurt AttnRes

AttnRes computes a softmax over block-summary similarities $\langle q, b_n \rangle$, and the rank of $b_n$ is bounded above by the rank of the attention output $O$. Compressing $O$ at GP rank $k$ means $b_n$ lives in (at most) a $k$-dimensional subspace of $\mathbb{R}^d$. If two blocks' summaries collapse onto nearby vectors in that subspace, the AttnRes softmax becomes noisier and the depth-memory mechanism weakens. This is the failure mode we expect to dominate at small $k/d$.

4.3 An honest prior

We expect AttnRes-on-compressed to be a wash at moderate compression ($k/d \in [0.4, 0.6]$) and a small loss at aggressive compression ($k/d < 0.3$). We would not be surprised by a small gain in either direction, and we would be surprised by a gain larger than a few percent. The publishable version of this section will report whichever of these turns out to be true.

What we will measure (planned)
  • WikiText-2 PPL with and without --attnres at three compression settings: uncompressed, GP-attn-only $k = 1024$, and GP-attn-only $k = 768$.
  • Per-depth residual-stream magnitude profile under each combination, to verify whether AttnRes still flattens the $\sqrt{L}$ envelope when the per-block update is compressed.
  • End-to-end decode tok/s, since AttnRes adds a softmax kernel that is not free.
§5, KV-cache compression

A footprint result, not a throughput result

The runtime additionally supports compressing the KV cache itself with the same per-layer basis $U^{(\ell, K/V)}$ used for $W_K$ and $W_V$. Enabled with --axex-kv. This is qualitatively different from weight compression: it saves memory linearly in context length, not per-step bandwidth. At the protocols tested in Paper 1 (decode-only, 200 generated tokens after a short prompt) the KV cache is small enough that this is a non-issue. The motivation for KV-cache compression is long-context: at 32k or 128k tokens the KV cache becomes the dominant VRAM consumer, and a $k = 1024$ projection cuts it by approximately $1 - k/d = 75\%$ at the cost of an additional $O(k \cdot d)$ projection per token on read.

We have not measured long-context behaviour. Paper 1's PPL evaluation runs on 512-token windows. The honest claim here is "the runtime supports KV-cache compression"; the longer-form claim "KV-cache compression preserves quality at 32k tokens" is unmeasured.

§5.5, First end-to-end measurements

SmolLM2-135M-Instruct, Q8_0, ChatML

The first end-to-end measurement of the OTT speculative path was completed on 2026-04-27. The host binary build_host\geodessical.exe running under the locked OTT pipeline (repair_ott.ps1) produces:

QuantityValueNotes
OTT readiness statusgeodesic_readyready=true, hybrid_ready=true, runtime_share=1.0, consistency=1.0
Acceptance rate $\alpha$38.5%5 geodesic-accepted / 13 total tokens; 8 verifier corrections
End-to-end throughput76.5 tok/s13 tokens in 170 ms, batch=4, threshold=0.45
OD draft hits5OneDecode table hits / 13 = 38.5%, same as overall acceptance on this prompt
SWARM-K hits0--ott-swarm-k 8 currently crashes; tracked in §8.
Final adaptive batch4Adaptive batch did not collapse below initial $\gamma$, acceptance was stable

The throughput model of §3 predicted, for a drafter cost ratio $T_D/T_V \approx 0.05$ and $\alpha = 0.385$ at $\gamma = 4$, a speedup of approximately $1.6\times$ over greedy-only on the same hardware. Greedy-only on this binary measures around 50 tok/s on the same prompt; the measured 76.5 tok/s gives an empirical $1.53\times$, within the closed-form prediction. The model worked on first measurement.

5.5.1, The instruct-greedy-EOS pathology

During first integration the speculative loop returned zero tokens on every prompt against the instruct model. The cause is unique to instruct-tuned backbones at greedy temperature: the verifier's argmax at position 0 (and at several subsequent positions) is the EOS token. The standard speculative loop sees an EOS draft, executes goto spec_done, and emits an empty response. Earlier published speculative-decoding work (Leviathan 2023, Chen 2023, Medusa, EAGLE) does not document this case because it primarily targets base (non-instruct) models where the greedy distribution does not degenerate into EOS.

The fix shipped in this runtime is a small primitive that we call logit-excluding top-1:

// runtime/nn/llm.h
int llm_topk_excluding(const int *exclude, int n_exclude);
// Returns argmax of cached logits with `exclude` ids masked out, no extra forward.

plus a min-response guard SPEC_MIN_RESP_N=4 that enables this bypass only at positions $i < 4$. After the first 4 emitted tokens the standard EOS-respect path takes over. This converts the instruct-greedy-EOS failure from "empty response" to "empty response only when the model truly intends to stop after at least 4 tokens of content". The four call sites in the speculative loop (accepted-drafts, correction-token, bonus-token, verifier-direct) are visible in host/main.c around geodesic_speculative_generate_text.

We are not aware of a published treatment of this pathology in the existing speculative-decoding literature. It is documented here primarily because the runtime numbers in the table above are conditional on the fix being in place; a reader who removes llm_topk_excluding from the loop and re-runs will see 0 tok/s.

5.5.2, Reproducing

git checkout d57162d  # OTT speculative ready commit
.\build_host.ps1
.\repair_ott.ps1 -ModelPath models\smollm2-135m-instruct-q8_0.gguf
.\build_host\geodessical.exe `
    --model models\smollm2-135m-instruct-q8_0.gguf `
    --ott-full --ott-speculative --ott-spec-batch 4 --ott-spec-thresh 0.45 `
    --prompt "Write a short greeting." --max-tokens 32

Output ends with [SPEC] Done: N tokens (..., acceptance_rate=...) and writes ott_readiness_report.json. A full GTC anchor for this same model (coverage, batch resonance, compressed records) is in docs/figures/gtc/GTC_RESULTS.md.

§6, What composes and what doesn't

The frank table, with measurements deferred

We list the four composition cells and our prior expectation for each, with the numbers explicitly marked as predictions until the benchmark pass produces them.

CompositionMechanismPrior expectationStatus
GP × speculative decodingDrafter $T_D$ drops; $\alpha$ may drop tooWash on consumer hardware (verifier-bound); positive on tier-asymmetric setupsMeasured on 135M-Instruct: $\alpha=0.385$, $1.53\times$ end-to-end, see §5.5
GP × AttnResBlock-summary subspace narrows with $k$Wash at moderate $k$; small loss at aggressive $k$Implemented, not measured
GP × KV-cache projectionLong-context VRAM savingUseful at $\geq$ 8k context; irrelevant at decode-only protocols of Paper 1Implemented, not measured at long context
Speculative × AttnRes (without GP)Same drafter and verifier pathSame as full speculative; AttnRes is orthogonal to the rejection mechanismImplemented, not measured

v0.3 fills the first row in this table for the 135M-Instruct model (§5.5). Rows 2–4 remain unmeasured pending the EC2 sweep.

§6.5, OneDecode, OTT-OD, and OTT-SWARM

Three drafter modes shipped in the runtime but absent from earlier drafts

Earlier drafts of this paper described only the geodesic-projection drafter of §2. The runtime in host/main.c ships three additional draft modes that compose with that pipeline. They are documented here so that the flag set on the binary matches the flag set described in this paper. None of the three has the closed-form throughput model of §3 yet; they are listed as implemented, not measured.

OneDecode (--one-decode)

OneDecode bakes a geodesic flow map once over a vocabulary slice (default $V_{\text{bake}}=2048$ tokens, settable with --one-decode-coverage N) and persists it to ott_one_decode.bin. At decode time, a hit on the table returns (token, confidence) in $O(1)$, the model forward is skipped. The intuition is the same as Paper 4 §2 (geodesic trajectory caching), but here it is exposed as a runtime drafter rather than a research artefact. On a miss, the runtime falls back to the geodesic drafter of §2 and then the verifier.

OTT-OD (--ott-od)

OTT-OD wires OneDecode in as the draft source for speculative decoding. The OneDecode lookup proposes a token; the verifier decides. This keeps the verifier (and therefore the acceptance distribution) identical to standard speculative decoding while letting the drafter cost go to zero on a table hit. --ott-od implies --one-decode so the bake step always runs.

OTT-SWARM (--ott-swarm K)

OTT-SWARM fans out $K$ candidate tokens per draft slot from the OneDecode table (or, on a miss, from the geodesic drafter), and submits all $K$ to the verifier in a single batched forward. This is structurally similar to Medusa-style multi-head drafting, except the candidates come from a baked flow map rather than learned auxiliary heads.

A reader looking at the source can verify that all three modes are real and composed from the same primitives: host/main.c §args parser, geodesic_ensure_one_decode, and the swarm fan-out around main.c:1924. The bake/save/load primitives live in runtime/nn/axiom_beta.c alongside the rest of the geometry-cache code.

Why these are listed as design rather than result: the bake step is deterministic and the table format is stable, but we have no published acceptance-rate or end-to-end tok/s measurement for any of the three modes. They are part of the §8 status list (item 1).

§8, Status

What v0.3 has and what is still missing

Now landed (v0.3):

  1. First end-to-end measurement on a 135M instruct model: $\alpha=0.385$, 76.5 tok/s, geodesic_ready. §5.5.
  2. llm_topk_excluding + SPEC_MIN_RESP_N guard for instruct-greedy-EOS. §5.5.1.
  3. Reproducible OTT repair pipeline (repair_ott.ps1), readiness gate (ott_readiness_report.json), and geometry-cache consistency-equivalence (reused_geometry_cache implies $\text{consistency}=1$).

Still missing for v0.4:

  1. Acceptance-rate sweep on Llama-3.1-8B (drafter = GP-compressed; verifier = uncompressed) on tier-asymmetric hardware that fits both models. Gated on EC2 compute.
  2. End-to-end tok/s comparison: full decode vs full-spec vs GP-spec, all under the locked 30-second cooldown protocol of Paper 1.
  3. AttnRes × GP perplexity sweep.
  4. Long-context (≥ 8k tokens) KV-cache compression footprint and PPL.
  5. Functional --ott-perfect mode (transformer-exact rollout). The first attempt hung in the llm_rollout_exact_greedy retry path and was reverted; this is the realistic route to $\alpha \ge 0.9$ on the same model.
  6. Functional --ott-swarm-k (currently exits non-zero); when fixed, expected to push $\alpha$ into the 0.6--0.8 range.
  7. Per-prompt OD bake (currently OD is baked once on a generic anchor; baking per-prompt is expected to lift $\alpha$ towards 0.7--0.8).
  8. Citation pass to fixed bibliography numbers.

The v0.3 publication threshold is met: the implementation is real, the first measurement exists, and the failure modes that block higher acceptance are enumerated rather than hidden.

§8.5, Limitations

What this paper does not establish

The v0.3 anchor is a single-model, single-hardware measurement, and the composition claims in §6 mix design with measurement. Read the following before quoting numbers from this paper.

  1. Single-model anchor. The 38.5%/76.5 tok/s result is on SmolLM2-135M-Instruct only. The closed-form throughput model (§3) predicts higher acceptance on larger drafters, but those predictions are unmeasured. The Llama-3.1-8B drafter sweep is gated on compute and explicitly listed under "still missing for v0.4" (§8).
  2. Acceptance is not a quality claim. $\alpha=0.385$ is the geometric verifier-acceptance rate, not a downstream-task score. Where users care about MMLU / HumanEval / GSM8K, those have not been re-measured under the spec path; Paper 1 §6 carries the only PPL anchor in this stack.
  3. Composition table is mostly design. Of the four cells in the GP × spec × AttnRes × KV table (§6), only GP × spec is measured. AttnRes composition is a prototype with a documented negative result on simplex blending and a positive result on single-anchor Jacobi transport; KV-cache compression footprint past 8k tokens is unmeasured.
  4. Instruct-greedy-EOS fix is local. The llm_topk_excluding + SPEC_MIN_RESP_N guard (§5.5.1) was identified and patched on the 135M-Instruct model. Whether the same pathology surfaces on larger instruct models with the same template family is untested; the fix is conservative (it only changes behaviour when the verifier proposes EOS at a position guarded by the response-length floor), so the failure mode if it generalises is "spec path silently degrades to greedy", not a correctness regression.
  5. OTT-perfect and OTT-swarm-k are not yet runnable. The most credible route to $\alpha\ge 0.9$ on this model (transformer-exact rollout) hangs in the llm_rollout_exact_greedy retry path; the swarm-k path exits non-zero. Until those are fixed, the upper-bound acceptance claims in §3 remain analytic.
  6. Hardware envelope. All numbers are on a single RTX 4070 Laptop / Ryzen 9 7940HS / 32 GB box. Spec gains are sensitive to verifier batch size, KV layout, and decode-vs-prefill ratio; cross-hardware reproduction is open work.

The composition claim that this paper does make is narrow: GP-compressed drafter on a verified path on a 135M-Instruct model runs end-to-end at geodesic_ready with the cited acceptance and tok/s. Everything past that is either marked open in §8 or framed as a closed-form prediction.

§9, References

Selected refs

  1. Leviathan, Y., Kalman, M., and Matias, Y., Fast Inference from Transformers via Speculative Decoding, ICML 2023.
  2. Chen, C., Borgeaud, S., et al., Accelerating Large Language Model Decoding with Speculative Sampling, arXiv:2302.01318, 2023.
  3. Liu, H., Wang, X., et al., Residual Stream Analysis in Pre-Norm Transformers, NeurIPS 2024, origin of the $\sqrt{L}$ growth observation.
  4. Kimi Team, Block Attention Residuals, arXiv:2603.15031, 2026.
  5. Cai, T. et al., Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads, 2024.
  6. Li, Y. et al., EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty, 2024.
  7. Zhang, J. et al., Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding, 2023.
  8. HyperTensor Paper 1: Calibration-Free Low-Rank Attention Compression..., 2026, source of $T_D$ and PPL numbers used in §3.