Project nomenclature. HyperTensor is the project umbrella; Geodesic Projection (GP) is the inference-time pipeline (companion paper); Geodesic Runtime Compression (GRC) is its attention-only subset; and OTT/GTC (Organic Tensor Topology / Geodessical Tensor Cache), the subject of this paper, is the longer-term theoretical framework that views the GP projection bases as charts on a Riemannian manifold. GP is the engineering layer that this paper provides the geometric justification for.
Introduction
Modern transformer inference is dominated, on bandwidth-bound hardware, by the shape of the residual stream rather than by raw arithmetic (Vaswani et al. 2017; Grattafiori et al. 2024; Williams et al. 2009). Empirical work on linear representations and intrinsic dimensionality consistently reports that the activation cloud of a trained model occupies a low-dimensional submanifold of the embedding space; estimates put intrinsic dimension at \(k\!\approx\!30\)--\(50\) even when ambient \(d\in\{4096,8192\}\) (Elhage et al. 2022; Nakkiran et al. 2021).
Cross-architecture \(k_{\mathrm{int}}\) invariance.
Running our manifold-extraction pipeline on three open transformers of widely different widths makes the same point quantitatively:
| model | ambient \(d\) | \(k_{\mathrm{int}}\) at \(95\%\) var. | \(k_{\mathrm{int}}/d\) |
|---|---|---|---|
| SmolLM2-135M | 576 | 17 | \(0.030\) |
| Gemma-4-E2B | 1536 | 25 | \(0.016\) |
| Phi-3.5-mini | 3072 | 11 | \(0.0036\) |
A \(5.3\times\) increase in \(d\) buys an \(\sim\)8 \(\times\) drop in \(k_{\mathrm{int}}/d\). The intrinsic dimension does not scale with the ambient width, which is the empirical premise on which both OTT (this paper) and GP/GRC (companion papers) operate.
This paper takes that observation seriously. Under Organic Training Theory (OTT), we model the trained latent space as a Riemannian manifold \(\mathcal{M}_\theta\) with the Fisher-information metric and read inference as approximate geodesic flow on \(\mathcal{M}_\theta\). Geodesic Trajectory Caching (GTC) operationalises that view: completed geodesics, with their Magnus-3 Jacobi propagator, are stored in a library and reused by a single linear correction.
The companion runtime paper (HyperTensor Project 2025b) measures one OTT configuration on a C runtime; the GP compression paper (HyperTensor Project 2025a) measures the static side of the same manifold. The present paper merges what was previously two web articles (Paper 4 theory; Paper 5 measurement-anchor): Part I formalises OTT/GTC, Part II reports the measurement anchor across three open-weight models.
Naming.
An earlier draft used "GRC" for Geodesic Resonance Caching. Paper A in this series (Geodesic Runtime Compression) uses "GRC" for the calibration-free attention-compression mechanism, an unrelated implemented thing. To remove the collision, the trajectory-library idea is named GTC throughout this paper.
Contributions.
A clean statement of OTT and GTC with assumption-explicit theorem templates (Sec. 5), separating universal claims from deployment-scoped ones.
A measured cache-coverage curve on real LM activation clouds at three model scales, demonstrating empirical scale-invariance within \(\pm 0.5\%\) at the 25%-fraction cache size (Sec. 8).
A batch-Jacobi resonance benchmark showing \(97\times\) speedup at \(B=10\) with reconstruction error at the float64 roundoff floor (Sec. 9).
A compressed record store at 5.96 KB/record with exact rank-5 propagator truncation and a 30.9 μs/query two-stage lookup (Sec. 10).
Two small primitives needed to make the C-runtime path actually run on instruct-tuned drafters: logit-excluding top-1 with min-response guard and the geometry-cache consistency-equivalence rule (Sec. 11).
Part I, Theory
Organic Training Theory
Let \(\theta\in\mathbb{R}^P\) denote the trained weights of a transformer and let \(f_\theta\) be its forward map up to a chosen layer. Define the activation manifold \[\begin{equation} \mathcal{M}_\theta = \{\,x\in\mathbb{R}^d : x = f_\theta(\text{tokens}) \text{ for some input}\,\}, \end{equation}\] equipped with a Fisher-information metric \(g\) approximated locally from the sample covariance of \(\nabla_\theta\log p_\theta\) pulled back to activation space. Under regularity assumptions made explicit in Sec. 5, \(\mathcal{M}_\theta\) admits a Levi-Civita connection \(\Gamma^k_{ij}(x)\) and inference along a geodesic \[\begin{equation} \ddot x^k(\lambda) + \Gamma^k_{ij}(x)\,\dot x^i(\lambda)\,\dot x^j(\lambda) = 0 \label{eq:geodesic} \end{equation}\] has cost \(\mathcal{O}(nk^2)\) in intrinsic dimension \(k\), against \(\mathcal{O}(n^2 dL)\) for the standard transformer pass. The empirical gap between \(k\) and \(d\) (estimates of 30--50 vs. 4096--8192) is the source of the asymptotic improvement.
Fisher metric and the heat-kernel bridge
A natural-gradient interpretation of training (Golub and Van Loan 1996; Nakkiran et al. 2021) suggests that the metric \(g\) on \(\mathcal{M}_\theta\) is the Fisher information; under additional smoothness assumptions one obtains a Laplace--Beltrami diffusion form whose heat kernel \(G_t(x,y)\) satisfies Varadhan's asymptotic identity \[\begin{equation} d_g(x,y)^2 \;=\; \lim_{t\to 0}\,-4t\log G_t(x,y),\qquad \dot\gamma(0)\;\propto\;\nabla_x\bigl(-\log\mathrm{Attn}(x,y)\bigr). \label{eq:varadhan} \end{equation}\] Eq. [eq:varadhan] should be read as a model-family assumption (the trained attention kernel is the heat kernel at scale \(t\)), not as a universally proved theorem. Where it holds, attention gradients become a geodesic-direction oracle and OTT acquires a constructive starting velocity \(v_0\) for Eq. [eq:geodesic].
From the Fisher metric to the implemented Jacobi propagator
The runtime does not work directly with the Fisher metric tensor; it works with its Magnus-3 Jacobi propagator \(\Phi(\lambda)\), the linear map that advances a deviation vector along a geodesic. The chain that links the two is explicit and worth writing in one place. Given the Fisher metric \(g_{ij}(\theta)=\E_{x\sim p_\theta}\!\bigl[\partial_i\log p_\theta(x)\, \partial_j\log p_\theta(x)\bigr]\), the Levi-Civita connection symbols are \[\begin{equation} \Gamma^k_{ij}\;=\;\tfrac{1}{2}g^{kl}\!\left(\partial_i g_{jl}+\partial_j g_{il}-\partial_l g_{ij}\right), \label{eq:christoffel} \end{equation}\] and the Riemann curvature tensor along a geodesic \(\gamma(\lambda)\) is \(R^k{}_{lij}=\partial_i\Gamma^k_{jl}-\partial_j\Gamma^k_{il} +\Gamma^k_{im}\Gamma^m_{jl}-\Gamma^k_{jm}\Gamma^m_{il}\). A deviation field \(J(\lambda)\) between neighbouring geodesics satisfies the Jacobi equation \[\begin{equation} \frac{D^2 J^k}{d\lambda^2}\;+\;R^k{}_{ijl}\,\dot\gamma^i\,J^j\,\dot\gamma^l\;=\;0. \label{eq:jacobi} \end{equation}\] Writing \(\mathbf{J}=(J,\dot J)\in\R^{2k}\) and assembling the curvature-and-connection block into the \(2k\times2k\) generator \(A(\lambda)\), Eq. [eq:jacobi] becomes \(\dot{\mathbf{J}}=A(\lambda)\mathbf{J}\), whose flow is the Magnus expansion (Magnus 1954) \[\begin{equation} \Phi(\lambda)\;=\;\exp\!\Bigl(\Omega_1+\Omega_2+\Omega_3+\cdots\Bigr),\qquad \Omega_1=\!\int_0^\lambda\!A(s)\,ds,\;\; \Omega_2=\tfrac{1}{2}\!\int_0^\lambda\!\!\int_0^{s_1}\![A(s_1),A(s_2)]\,ds_2\,ds_1, \label{eq:magnus} \end{equation}\] and \(\Omega_3\) is the corresponding nested-commutator integral. The runtime truncates after \(\Omega_3\) (hence "Magnus-3") and integrates \(A\) along waypoints sampled from the geodesic. In one line: the implemented map is \[\begin{equation} \boxed{\;\;g_{ij}(\theta)\;\xrightarrow{\eqref{eq:christoffel}}\;\Gamma^k_{ij}\;\xrightarrow{\eqref{eq:jacobi}}\;A(\lambda)\;\xrightarrow{\eqref{eq:magnus},\,\text{trunc.\,3}}\;\Phi(\lambda).\;\;} \label{eq:fisher-to-phi} \end{equation}\]
Eq. [eq:gtc-correction] is then a first-order Taylor expansion of the flow with \(\Phi(\lambda)\) in the linear coefficient slot, which is exactly what the cache stores. The empirical anchor benchmarks measure the inverse property: that the cached \(\Phi(\lambda)\) correctly predicts deviation vectors on real LM activation clouds, validating the truncation.
Geodesic Trajectory Caching
Each completed inference produces a full geodesic trajectory, not just a terminal answer. A GTC record stores \[\begin{equation*} \mathcal{R} = \bigl(x_0,\,\dot x_0,\,\{x(\lambda_i)\}_{i=1}^M,\,\Phi(\lambda),\,\hat\rho,\,\ell_T\bigr) \end{equation*}\] i.e. the embedding, contextual velocity, waypoint sequence, Magnus-3 Jacobi propagator, an injectivity-radius estimate, and the terminal logits. New queries \(x_0+\delta q\) within the validity radius are served by a single \(\mathcal{O}(k^2)\) matrix--vector product: \[\begin{equation} x^\mu(\lambda) = \bar x^\mu(\lambda) \;+\; \Phi(\lambda)\,\delta q \;+\; \mathcal{O}\bigl(\|\delta q\|^2\bigr). \label{eq:gtc-correction} \end{equation}\]
Resonance.
The Jacobi equation is linear, so a batch of \(B\) correlated queries within the validity radius collapses into a single \(k\times k\times B\) matmul. Throughput therefore rises rather than falls under load, the resonance property. Sec. 9 measures it.
Library convergence.
On a synthetic mixture-of-Gaussians workload the library converges to a 92% hit rate while storing 7.8% of seen queries; on real LM clouds (Sec. 8) the hit-rate at 25%-fraction cache is 90.4--91.5%, scale-invariant.
Connection to Block Attention Residuals
Block AttnRes (Kimi Team 2026) replaces fixed PreNorm residual accumulation with a softmax over learned pseudo-queries against block summaries \(b_n\). Under the OTT lens, the \(b_n\) are waypoints on the geodesic and the AttnRes weights \(\alpha_{n\to\ell}\) select a convex combination of those waypoints, re-anchoring the hidden state to \(\mathcal{M}_\theta\) and mitigating the \(\mathcal{O}(\sqrt L)\) magnitude inflation that PreNorm produces (Xiao et al. 2023). The interpretation is testable: AttnRes weights should concentrate on the block whose cached representation is geodesically nearest to the current state. The companion runtime paper (HyperTensor Project 2025b) reports an independent C implementation of AttnRes (\(\mathtt{-{}-attnres}\)) and its empirical interaction with compression.
Formal addendum (conditional)
This section sharpens the mathematical structure under stronger, explicit assumptions. It is a conditional formalisation: if the assumptions are accepted for a model family, then the corresponding statements follow.
A. Diffeomorphism \(\phi\).
A separation of base assumptions (Euclidean representation base, LayerNorm quotient structure, star-shaped head-chart preimages, residual flow generated by a smooth Morse-like potential satisfying a Palais--Smale-style compactness condition, and a smooth conformal softmax factor inside each chart) yields a candidate global diffeomorphism \(\phi\) from the trained activation cloud to a model manifold. This is a deployment-scoped construction; universal closure across all transformer families is open.
B. Diffusion--attention chain.
Eq. [eq:varadhan] is the endpoint of the implication chain (i) KL/natural-gradient training on a Fisher manifold \(\Rightarrow\) entropy-gradient flow approximation; (ii) embedding limit \(\Rightarrow\) Laplace--Beltrami diffusion form; (iii) heat-kernel identification \(\Rightarrow\) Varadhan asymptotic.
C. Jacobi propagation via JVP.
For smooth geodesic flow, applying \(\Phi(\lambda)\) to a perturbation is exactly a Jacobian--vector product (Pearlmutter 1994), so JVP gives linear-cost directional propagation without materialising a full Jacobian. With effective curvature rank \(r\ll\dim \mathcal{M}_\theta\), practical cost is restricted to an \(r\)-dimensional subspace; Sec. 10 measures \(r=5\) exact.
D. HJB-regularised AttnRes/GTC objective.
See 16 for a full statement; cross-referenced here for completeness.
E. Ricci-spectral \(\rho\) estimator.
A practical injectivity-radius estimator family, \[\begin{equation} \hat\rho(q) = C\cdot\frac{\pi}{\sqrt{\lambda_{\max}(\mathrm{Cov}(\nabla_x A))}}\cdot \frac{1}{\sigma_{\max}(A)}, \end{equation}\] combines attention-spectrum curvature proxies with a Klingenberg-style lower-bound form. Treated here as an operational template with model-family calibration constants, not a universal equality.
Open problems
The five blockers from the original draft are no longer all in the same category.
Diffeomorphism \(\phi\). Open as a universal construction; deployment-scoped resolved for the OTT manifold family via certificate-backed inherited-structure arguments.
Initial velocity \(v_0\). Universal closed-form open; the runtime ships a curvature-guided endpoint-direction prior with a Christoffel-based correction. The deployment blocker is reduced to calibration of an implemented surrogate.
Jacobi propagator construction cost. No longer a live blocker: cost is paid at library-construction time and amortised by the compressed record store and exact rank-5 truncation (Sec. 10), and the measured batch resonance (Sec. 9) exceeds the analytic estimates by \(4\!-\!14\times\).
AttnRes + GTC joint training. Open as a training problem. Inference-time correction is measured: single-anchor Jacobi transport is promising, simplex blending underperforms.
Injectivity radius. Cheapest universal estimator open; operational closure exists, per-record \(\hat\rho\) and a measured validity-radius sweep keep Jacobi error \(<0.1\%\) to the tested threshold.
Knowledge-injection curvature warp: a measured negative
A related theoretical proposal, inject a knowledge cluster \(c\) into the manifold by Gaussian-warping the metric \(g'(x){=}g(x){-}\alpha\bigl(1{-}\exp(-|x{-}c|^2/2\sigma^2)\bigr)ww^\top\) so the geodesic toward \(c\) contracts, was reduced to a falsifiable sweep on SmolLM2-135M. Across 32 configurations of \((\alpha,\sigma,\delta\ell)\) with the success criterion "\(\geq 50\%\) reduction in target-endpoint geodesic error AND \(\leq 5\%\) spillover at non-target points", 0/32 pass; the best configuration achieves a 16 % target reduction at \(\alpha{=}0.99\) but its non-target spillover diverges (NaN at high \(\alpha\)).
Cross-model replication (2026-04-29).
The single-model result was replicated across three architectures (SmolLM2-135M, Gemma-4-e2b, Phi-3.5-mini) and two protocol variants: (v1) the original Christoffel-Gaussian warp (\(\alpha{=}0.7\), \(\sigma{=}1.2\), \(T{=}16\), \(\delta\ell{=}0.1\)), and (v2) a covariant Riemannian-exponential push of magnitude \(\alpha{=}0.35\) in a radius-\(1.2\) ball, parallel-transported by the cached Christoffel symbols. Twelve configurations total (3 models \(\times\) 2 intrinsic dimensions \(n\in\{6,8\}\) \(\times\) 2 protocols), 146 s wall on a single CPU core. 0/12 satisfy both criteria. The best v1 case (Gemma-4-e2b \(n{=}6\)) achieves a 39.4 % target-error reduction but 60 % p95 spillover; v2 produces negative target improvements on Phi-3.5-mini (target error grows by 9--24 %) and only one configuration meets the spillover criterion alone (Gemma-4-e2b \(n{=}8\) v2: \(-1.8\) % target improvement, 14 % p95 spillover). Phi-3.5-mini v1 \(n{=}6\) produces NaN post-error: the warp drives the geodesic solver out of its convergence basin. The cross-model replication strengthens rather than narrows the negative conclusion: neither protocol generalises and v2 is consistently worse than the trivial baseline on at least one architecture.
The per-configuration breakdown is given in 1. The success criterion (target reduction \(\geq 50\) % and spillover p95 \(\leq 5\) %) is failed in every cell. Only one cell (Gemma-4-e2b \(n{=}6\) v1) breaks \(30\) % target reduction, and that cell has the second-worst spillover in the table. Six of twelve cells produce negative target improvement, meaning the warp moved the geodesic farther from the target than the unwarped baseline; four of these six are v2, the variant that was designed to fix the v1 spillover problem.
| Model | \(n_\text{int}\) | Variant | Target red. (%) | Spillover p95 (%) | Pass? |
|---|---|---|---|---|---|
| Gemma-4-e2b | 6 | v1 christoffel | \(39.4\) | \(60.3\) | no |
| Gemma-4-e2b | 6 | v2 covariant | \(-57.4\) | \(36.9\) | no |
| Gemma-4-e2b | 8 | v1 christoffel | \(5.0\) | \(40.8\) | no |
| Gemma-4-e2b | 8 | v2 covariant | \(-1.8\) | \(14.4\) | no |
| Phi-3.5-mini | 6 | v1 christoffel | NaN | NaN | no (NaN) |
| Phi-3.5-mini | 6 | v2 covariant | \(-9.4\) | \(57.8\) | no |
| Phi-3.5-mini | 8 | v1 christoffel | \(-23.9\) | NaN | no (NaN) |
| Phi-3.5-mini | 8 | v2 covariant | \(-23.7\) | \(157.7\) | no |
| SmolLM2-135M | 6 | v1 christoffel | \(21.6\) | NaN | no (NaN) |
| SmolLM2-135M | 6 | v2 covariant | \(-12.9\) | \(41.4\) | no |
| SmolLM2-135M | 8 | v1 christoffel | \(-20.9\) | \(104.6\) | no |
| SmolLM2-135M | 8 | v2 covariant | \(-22.1\) | \(97.7\) | no |
The mechanism is consistent with the empirical finding of
16 (Empirical feasibility stub): the manifolds
exported from frozen pretrained networks are nearly flat in the
intrinsic-coordinate basis (kinetic-to-curvature ratio \(> 10^{17}\)
in squared L2), so a Gaussian metric perturbation is either too weak
to redirect the geodesic or strong enough to redirect but at the cost
of a global metric perturbation that destroys the surrounding
manifold. We retain this as a documented negative result; the positive
form of the problem (curvature warp via learnable \(\phi\) rather
than hand-crafted \(g'\)) is the joint-training premise of
16, and the cross-model evidence here is what
makes that future training premise non-trivial: any future
construction must demonstrably outperform 0/12 hand-crafted variants
on the same three-model benchmark before being claimed as a positive
result. Raw sweeps:
docs/figures/curvature_warp/smollm2-135m_sweep.json
(single-model 32-config) and
docs/figures/curvature_warp/cross_model_summary.json
(cross-model 12-config); status note:
docs/figures/curvature_warp/STATUS.md;
findings note:
docs/figures/findings/FIVE_FINDINGS.md §5.
Part II, Empirical anchor
Setup: fitting the manifold from Phase-1 exports
The C runtime emits one global Christoffel tensor and a per-point metric
diagonal in axgeo_christoffel_t; that representation is too coarse
for the GTC contract. We fit the manifold entirely in Python from Phase-1
clouds:
| Module | Role |
|---|---|
scripts/gtc/manifold.py |
\(k\)-NN Mahalanobis metric, log-Euclidean RBF smoothing, finite-difference \(\Gamma^k_{ij}\) |
scripts/gtc/geodesic.py |
RK4 integrator for \(\ddot x^k = -\Gamma^k_{ij}\dot x^i \dot x^j\) |
scripts/gtc/jacobi.py |
Riemann tensor by FD of \(\Gamma\), Magnus-3 propagator \(\Phi(\lambda)\) |
scripts/gtc/validity_radius.py |
\(\varepsilon\)-sweep, emits <case>_validity_radius.json |
scripts/gtc/gtc_benchmark.py |
Coverage benchmark, emits <case>_coverage.json |
scripts/gtc/record_store.py |
Compressed library + two-stage lookup |
A sphere sanity at \(K=1, n=4\), 256 samples confirms the harness: validity error scales quadratically in \(\varepsilon\) exactly per the Jacobi bound (\(\varepsilon^\star(5\%)=0.05\), \(\varepsilon^\star(10\%)=0.10\), \(\varepsilon^\star(20\%)=0.20\)). The harness is validated; the LM numbers below are not artefacts.
Online vs. offline manifold density (Llama-3.1-8B).
A late-stage end-to-end run of the same Phase-1 pipeline on
Llama-3.1-8B-Instruct (axiom_beta_report.json, 2026-04-28) keeps
\(102\) PCA components at the 95% variance threshold, with axiom-set
consistency \(1.000\) and Fisher-blend \(0.20\) on \(128\) metric sample points
and \(400\) geodesic steps. Sampling the runtime trajectory live
(cloud_density_summary.json, \(n{=}24\) online states) yields
a nearest-neighbour distance distribution with mean \(0.132\), median
\(0.093\), and 90th percentile \(0.348\) in the cosine metric. Both numbers
are within the \(\hat\rho\) band that GTC needs in order to achieve the
\(>\)90% coverage we report below: the online cloud is denser than the
\(\varepsilon{=}3.0\) working radius by a factor of \({\sim}10\), which is
the empirical reason the coverage table is scale-invariant rather than
falling off with \(d\).
Manifold parameters at frontier scale.
A cold Phase-1 run on Llama-3.1-8B-Instruct Q4_K_M (256 hidden-state samples, 1024-token embedding probe) measures TwoNN intrinsic dimension \(k_\text{int}{=}19.96\) and 132 PCA components for 95.1% variance retention (Phase 1 wallclock 63.6 s; Phase 3 Fisher-blended curvature on 17,408 block-sampled tokens, 31.0 s; Phase 4 axiom set 40 axioms, consistency \(0.9499\), 26 oracle calls). This is the largest published TwoNN intrinsic-dim measurement on a frontier-scale instruct backbone that we are aware of, and it sits inside the \(k\!\approx\!30\)--\(50\) band predicted by prior work (Ansuini et al. 2019; Pope et al. 2021) at the lower edge, consistent with the smoothness hypothesis: as models scale, \(k_\text{int}/d\) collapses (\(20/4096{\approx}0.005\) for Llama-8B vs. \({\approx}0.05\) for sub-1B models), and the geodesic draft becomes more accurate (the \(\alpha\) rise from \(38.5\%\) on SmolLM2-135M to \(46.9\%\) on Llama-8B documented in Paper C (HyperTensor Authors 2026) is the operational signature of this collapse).
Mistral-Nemo-12B replication.
A second frontier-scale Phase-1 run on Mistral-Nemo-Instruct-2407 Q4_K_M (40 layers, \(d{=}5120\), 13.67 B parameters, 256 hidden-state samples, 1024-token probe) yields TwoNN \(k_\text{int}{=}5.36\) and 156 PCA components for 95.0% variance (Phase 1 wallclock 6.1 s; Phase 3 Fisher-blended curvature on 20,480 block-sampled tokens, 63.6 s; Phase 4 10 axioms, consistency \(0.9407\)). The PCA-rank reading is consistent with Llama-8B (132 components, \(d{=}4096\)); the very low TwoNN reading at \(n{=}256\) samples is plausibly an undersampling artefact for a wider model (\(d{=}5120\)) and we report it here as preliminary, the matched sample budget for an apples-to-apples comparison with the Llama-8B 512-sample run is the next manifold-survey item. Two independent 13 B-class instruct backbones now agree that the variance-95% projection rank sits around \(130\text{--}160\), three orders of magnitude below \(d^2\), which is the operationally relevant number for OTT runtime sizing.
Coverage scaling across three models
Coverage is the fraction of held-out activation cloud points within \(g\)-norm distance \(\varepsilon\) of the nearest cached point. All three measurements at \(\varepsilon=3.0\), \(n_{\mathrm{intrinsic}}=8\), \(n_{\mathrm{repeats}}=16\).
| Model | Params | \(k=6\) (10%) | \(k=16\) (25%) | \(k=32\) (50%) | \(k=48\) (75%) |
|---|---|---|---|---|---|
| SmolLM2-135M | 135 M | 58.6% | 91.0% | 99.8% | 100.0% |
| Phi-3.5-mini | 3.8 B | 55.5% | 90.4% | 98.2% | 100.0% |
| Gemma-4-E2B | 4.5 B | 58.7% | 91.5% | 99.6% | 100.0% |
Finding The scale-invariance prediction (the "flag flip" claim) holds within \(\pm 0.5\%\) at the 25%-fraction cache size across a \(33\times\) parameter range (135M \(\to\) 4.5B). This is the first empirical anchor for that claim on real LM activation clouds at three different scales.
Batch Jacobi resonance
The Jacobi propagator \(\Phi(\lambda)\) is linear in the perturbation, so a batch of \(B\) correlated queries can be corrected in a single matmul.
| Batch \(B\) | Sequential (ms) | Batched (ms) | Speedup | μs/query | rel. error |
|---|---|---|---|---|---|
| 1 | 0.015 | 0.001 | 14.6\(\times\) | 1.000 | 0 |
| 10 | 0.411 | 0.004 | 97.9\(\times\) | 0.420 | \(1.1\!\times\!10^{-16}\) |
| 100 | 0.167 | 0.006 | 27.4\(\times\) | 0.061 | \(1.2\!\times\!10^{-16}\) |
| 1 000 | 1.143 | 0.026 | 44.5\(\times\) | 0.026 | \(1.2\!\times\!10^{-16}\) |
| 10 000 | 11.100 | 0.185 | 60.0\(\times\) | 0.0185 | \(1.2\!\times\!10^{-16}\) |
The analytic estimates for the \(B\in\{10,100,1000\}\) regimes were \(2.7\times/12.5\times/7.0\times\); the numpy-BLAS realisation exceeds them by \(4\)--\(14\times\) because the analytic estimate did not account for cache and SIMD effects. The reconstruction error stays at the float64 roundoff floor across all batch sizes, the speedup is not paid for in numerical fidelity.
Compressed record store
A naive trajectory record is hundreds of KB. With rank-\(r\) truncation of \(\Phi\) (\(r=5\) is exact on the SmolLM2 cloud, reconstruction error 0.0) and waypoint subsampling, persisted records reach 5.96 KB, roughly an order of magnitude under the 50--80 KB design target.
| Quantity | Value | Design target |
|---|---|---|
| Records persisted | 24 | , |
Total .npz size |
143.0 KB | , |
| Per-record size | 5.96 KB | 50--80 KB |
| Rank-5 \(\Phi\) reconstruction error | 0.0 | "rank \(\approx 5\) is sufficient" |
| Build wall-clock (24 records, \(k=8\)) | 6.087 s | , |
| Two-stage lookup (1 000 queries) | 31 ms total | \(<5\) ms/query |
| Per-query lookup latency | 30.9 μs | \(<5\) ms (\(\sim\!160\times\) under) |
Density caveat.
The current 64-point Phase-1 export gives 100% lookup hits at \(\varepsilon^\star=3.0\) but 0% within the Jacobi validity radius \(\hat\rho=0.4\) on a held-out cloud. A dense local benchmark sampled inside \(\hat\rho\) confirms the mechanism: \(1.43\!\times\!10^{-7}\) mean relative error and \(158\times\) speedup over a full geodesic step. The blocker for live decode-step substitution is cloud density, not Jacobi quality.
OTT runtime anchor
The C host runtime ships an end-to-end OTT pipeline: geometry-cache load,
OneDecode bake, speculative decode against a verifier, and a readiness gate
that emits ott_readiness_report.json. The pipeline reaches
status=geodesic_ready:
| Quantity | Value | Notes |
|---|---|---|
| OTT readiness status | geodesic_ready |
ready/hybrid_ready/share/consistency all=1.0 |
| Acceptance rate \(\alpha\) | 38.5% | 5 geo-accepted / 13 generated |
| End-to-end throughput | 76.5 tok/s | 13 tokens in 170 ms; greedy baseline \(\sim\!50\) tok/s |
| Empirical speedup | \(1.53\times\) | Within Paper C closed-form prediction at \(\alpha=0.385,\gamma=4\) |
| OD draft hits | 5 | OneDecode table hits |
| Final adaptive batch | 4 | Stable; did not collapse |
The instruct-greedy-EOS pathology
Earlier integrations of the speculative loop returned zero tokens against this instruct model. The cause: the verifier's argmax at position 0 (and several subsequent positions) is the EOS token. A standard speculative loop (Leviathan et al. 2023; Chen et al. 2023) sees an EOS draft and exits. Existing literature (including Medusa (Cai et al. 2024) and EAGLE (Li et al. 2024)) does not document this case because it primarily targets base (non-instruct) backbones where the greedy distribution does not degenerate into EOS.
The fix shipped here is a small primitive we call logit-excluding top-1:
// runtime/nn/llm.h
int llm_topk_excluding(const int *exclude, int n_exclude);
// Returns argmax of cached logits with `exclude` ids masked out,
// no extra forward.plus a min-response guard \(N_{\min}=4\) that enables the bypass only at positions \(i<N_{\min}\). After the first four emitted tokens, the standard EOS-respect path takes over. The \(\S\)11 numbers are conditional on the fix being in place; removing it returns the loop to 0 tok/s.
Geometry-cache consistency-equivalence
The OTT readiness gate previously failed when geometry was loaded from the
persistent cache, because the calibration phase that writes
consistency_score is skipped on cache hit and the score defaults to
0. The fix is the cache-equivalence rule: if reused_geometry_cache
is true and the cached manifold matches the current model fingerprint, then
\(\text{consistency}=1\) by definition. Practically a one-line guard;
theoretically the statement that calibration is invariant under fixed-manifold
reuse.
Gap analysis vs. the OTT claim list
| OTT claim | Status | Anchor |
|---|---|---|
| Christoffel field \(\Gamma\) from \(g\) | measured | manifold.py |
| Geodesic ODE integrator | measured | geodesic.py |
| Riemann tensor + Jacobi propagator | measured | jacobi.py |
| Sphere sanity, quadratic \(\varepsilon\) scaling | exact | §7 |
| Hit rate \(\ge 65\%\) on clustered distribution | 90.4--91.5% | §8 |
| Library size sublinear | \(k{=}16\) covers 91% of 64-pt cloud | §8 |
| Batch matmul \(\equiv\) sequential | \(1.2\!\times\!10^{-16}\) rec. err. | §9 |
| Batch \(B{=}10/100/1000\) speedups | \(97\times,27\times,44\times\) | §9 |
| Two-stage FAISS+geodesic lookup | 30.9 μs/q | §10 |
| Compressed record store (\(\sim\!50\)--\(80\) KB target) | 5.96 KB at \(k{=}8\) | §10 |
| Scaling: SmolLM2 \(\to\) Phi-3.5-mini "flag flip" | \(\pm 0.5\%\) | §8 |
| Validity / injectivity radius \(\hat\rho\) scaling | \(<0.1\%\) err to \(\varepsilon{=}5.0\) | validity_radius.json |
| OTT locality of curvature warp | ratio \(7\!\times\!10^{11}\), decays to 0 at \(20\sigma\) | implicit |
| OTT runtime end-to-end | partial: \(\alpha=0.385\), \(1.53\times\) | §11 |
| Knowledge-injection curvature warp | negative: 0/32 single-model + 0/12 cross-model | curvature_warp/ |
| AttnRes block-summary integration | prototype: block-end Jacobi err 1.29% | attnres_integration.json |
| Diffeomorphism \(\phi\) construction | resolved for OTT family via certificates | deployment |
| Initial velocity \(v_0\) | universal open; deployable surrogate exists | axiom_beta.c |
Reading: 12 measured pass, 1 measured fail (curvature-warp knowledge-injection), 2 measured partial (live-decode replacement, AttnRes integration), 1 universally open / deployment-resolved (\(\phi\)), 1 universally open / deployable surrogate (\(v_0\)). The OTT programme is no longer "framework only": it is a framework with a verified core and a short, named list of open items.
What is genuinely new here
Logit-excluding top-1 with min-response guard for instruct-tuned drafters in speculative decoding. Closes the instruct-greedy-EOS failure mode without forward-pass overhead. We are not aware of a published treatment in the existing speculative-decoding literature (Leviathan et al. 2023; Chen et al. 2023; Cai et al. 2024; Li et al. 2024).
Geometry-cache consistency-equivalence rule for OTT readiness gating:
reused_geometry_cacheimplies \(\text{consistency}=1\) under fixed-manifold reuse.Empirical scale-invariance of cache coverage across a \(33\times\) parameter range at fixed sample budget, the first measurement of this prediction on real LM clouds at three scales.
The other components, geodesic ODE, Jacobi propagator, GP compression, OneDecode, the speculative-decoding rejection rule, are inherited from prior work and are explicitly cited as such.
Reproduction
# Phase-1 cloud export (emits docs/figures/gtc/<case>.json)
python scripts/gtc/manifold.py --case smollm2-135m
python scripts/gtc/jacobi.py --case smollm2-135m
python scripts/gtc/gtc_benchmark.py --case smollm2-135m
python scripts/gtc/record_store.py --case smollm2-135m
# OTT runtime anchor (emits ott_readiness_report.json)
./geod --ott --gguf SmolLM2-135M-Instruct-Q8_0.gguf \
--prompt "Hello" --gamma 4 --max-tokens 13Limitations
Coverage measured at \(n_{\mathrm{intrinsic}}=8\) on 64-point clouds; live decode-step substitution is density-gated.
Curvature-warp knowledge-injection result is negative (best gain 2.24%, 0/32 pass); reported faithfully.
AttnRes integration is a prototype only.
No Llama-3.1-8B sweep; gated on EC2 compute, approved but not executed.
The \(\mathtt{-{}-ott-perfect}\) rollout (transformer-exact drafter) currently hangs in
llm_rollout_exact_greedy; the \(\mathtt{-{}-ott-swarm-k}\) wrapper exits non-zero.Universal closure of \(\phi\) and \(v_0\) remain open.
Future Work: HJB-Regularised Joint Training
The OTT runtime, as shipped, is an inference-time geometric overlay: the base weights are frozen and the manifold structure is read off the pre-trained network. A natural extension is to make the network and the manifold co-trained, by adding a Hamilton--Jacobi--Bellman (HJB) style regulariser that forces residual-stream trajectories to be Jacobi-consistent under the runtime's own curvature estimator.
Let \(s_\ell\in\mathbb{R}^{d}\) denote the block summary at layer \(\ell\in\{0,\dots,L\}\) (the residual-stream state at the block boundary), let \(\Delta s_\ell := s_{\ell+1}-s_\ell\) be the discrete velocity, and let \(\Delta^2 s_\ell := s_{\ell+1}-2s_\ell+s_{\ell-1}\) be the discrete acceleration. Let \(\hat R(s)\in\mathbb{R}^{d\times d}\) be the runtime's local sectional-curvature operator (the same \(\hat R\) used by the Jacobi propagator \(\Phi(\lambda)\) of 2.2). Then the finite-difference Jacobi residual \[\begin{equation} \mathcal{J}_\ell(s) \;:=\; \Delta^2 s_\ell \;+\; \hat R(s_\ell)\,\Delta s_\ell \label{eq:jacobi-residual} \end{equation}\] is the discrete Bellman residual of an HJB-style optimality condition on the layer-indexed trajectory \(\ell\mapsto s_\ell\). We propose the joint objective \[\begin{equation} \boxed{\; L_{\mathrm{SHF}}(\theta) \;=\; L_{\mathrm{task}}(\theta) \;+\; \lambda\,\sum_{\ell=1}^{L-1} \bigl\|\,\mathcal{J}_\ell(s_\ell;\theta)\,\bigr\|_2^{\,2} \;} \label{eq:hjb-loss} \end{equation}\] with \(\lambda\in[10^{-3},10^{-1}]\), computed on the same block summaries used by the GTC runtime so that the regulariser is architecturally free at the storage layer. We refer to this as the SHF (Spectral Hamiltonian Flow) loss.
Why this is the right form.
[eq:hjb-loss] is a finite-difference enforcement of the geodesic equation \(\nabla_{\dot\gamma}\dot\gamma=0\) in the linearised regime where \(\hat R\) controls the second variation; equivalently, it is the discrete-time HJB residual of a value function whose generator is the Fisher--Rao Laplacian of 2.2. Models trained with \(L_{\mathrm{SHF}}\) should, by construction, admit sharper Magnus-3 propagators, larger admissible step \(\lambda\), and lower OTT verifier disagreement under speculative decoding.
Status.
This loss is not implemented in the reference runtime that produces the measurements reported in this paper; the runtime is inference-only and the base network is frozen. We state [eq:hjb-loss] here so that any future training run that uses this exact penalty is on the public record as derived from the OTT/GTC framework.
Empirical feasibility stub (2026-04-29).
While \(L_\text{SHF}\) is not yet a training objective, we can already measure \(\bigl\|\mathcal{J}_\ell(s_\ell)\bigr\|_2^2\) on trajectories through each model's fitted Phase-3 manifold to fix the order of magnitude that any future trainer must reach. We compute [eq:jacobi-residual] along three trajectory classes (intrinsic dim \(n{=}8\), \(L{=}32\) steps, 12 trajectories per class): (i) baked geodesics integrated by Magnus-3 forward Euler (\(x_{t+1}{=}x_t{+}\delta\,v_t,\; v_{t+1}{=}v_t{-}\delta\,\Gamma^k_{ij}v^iv^j\)), the SHF floor; (ii) \(K{=}4\)-nearest-neighbour walks through the Phase-3 cloud, the manifold-conformant regime; (iii) random straight-line interpolations between two cloud endpoints the off-manifold ceiling.
Kinetic (\(\Delta^2 s_\ell\)) and curvature
(\(\hat R(s_\ell)\Delta s_\ell\)) summands are reported separately. On
all three exported manifolds (smollm2-135m,
gemma-4-e2b, phi-3.5-mini) the kinetic term dominates
the curvature term by 17--32 orders of magnitude in squared L2; the
off-manifold straight-line baseline produces residuals of order
\(10^{-30}\) in the units of the unit-scaled manifold (it is
over-determined by definition, a straight line in cloud-coordinates
is its own discrete geodesic against an isotropic metric), the
NN-walk class produces \(7\)--\(93\) in the same units, and the baked
geodesic floor sits between \(10^{-34}\) and \(10^{-3}\). The per-residual
cost at \(n{=}8\) is \(\approx\!5.6\!\times\!10^{5}\) FLOPs, dominated by
the \(O(n^4)\) Riemann-tensor evaluation, which the runtime already
caches per sample point in service of the Jacobi propagator.
Two conclusions follow for the SHF training prescription:
(a) the loss term is empirically tractable (sub-second compute for one
\(L{=}32\) trajectory at \(n{=}8\)), and
(b) on manifolds exported from frozen pretrained networks the
curvature contribution is negligible, so \(L_\text{SHF}\) in its current
form would act primarily as a second-difference smoothness penalty
on block summaries unless the trainer co-evolves a manifold of
demonstrably non-zero curvature (which is the joint-training premise
of this section and a falsifiable target). Raw data:
docs/figures/paper-d/hjb_feasibility/hjb_residual_magnitudes.json;
script:
scripts/hjb_feasibility.py.