Paper D - OTT/GTC

Organic Training Theory and the GTC Manifold Runtime

HyperTensor Project (William Ken Ohara Stewart)

HyperTensor Project · April 2026 · PDF · TeX source

Project nomenclature. HyperTensor is the project umbrella; Geodesic Projection (GP) is the inference-time pipeline (companion paper); Geodesic Runtime Compression (GRC) is its attention-only subset; and OTT/GTC (Organic Tensor Topology / Geodessical Tensor Cache), the subject of this paper, is the longer-term theoretical framework that views the GP projection bases as charts on a Riemannian manifold. GP is the engineering layer that this paper provides the geometric justification for.

Introduction

Modern transformer inference is dominated, on bandwidth-bound hardware, by the shape of the residual stream rather than by raw arithmetic (Vaswani et al. 2017; Grattafiori et al. 2024; Williams et al. 2009). Empirical work on linear representations and intrinsic dimensionality consistently reports that the activation cloud of a trained model occupies a low-dimensional submanifold of the embedding space; estimates put intrinsic dimension at \(k\!\approx\!30\)--\(50\) even when ambient \(d\in\{4096,8192\}\) (Elhage et al. 2022; Nakkiran et al. 2021).

Cross-architecture \(k_{\mathrm{int}}\) invariance.

Running our manifold-extraction pipeline on three open transformers of widely different widths makes the same point quantitatively:

model ambient \(d\) \(k_{\mathrm{int}}\) at \(95\%\) var. \(k_{\mathrm{int}}/d\)
SmolLM2-135M 576 17 \(0.030\)
Gemma-4-E2B 1536 25 \(0.016\)
Phi-3.5-mini 3072 11 \(0.0036\)

A \(5.3\times\) increase in \(d\) buys an \(\sim\)8 \(\times\) drop in \(k_{\mathrm{int}}/d\). The intrinsic dimension does not scale with the ambient width, which is the empirical premise on which both OTT (this paper) and GP/GRC (companion papers) operate.

This paper takes that observation seriously. Under Organic Training Theory (OTT), we model the trained latent space as a Riemannian manifold \(\mathcal{M}_\theta\) with the Fisher-information metric and read inference as approximate geodesic flow on \(\mathcal{M}_\theta\). Geodesic Trajectory Caching (GTC) operationalises that view: completed geodesics, with their Magnus-3 Jacobi propagator, are stored in a library and reused by a single linear correction.

The companion runtime paper (HyperTensor Project 2025b) measures one OTT configuration on a C runtime; the GP compression paper (HyperTensor Project 2025a) measures the static side of the same manifold. The present paper merges what was previously two web articles (Paper 4 theory; Paper 5 measurement-anchor): Part I formalises OTT/GTC, Part II reports the measurement anchor across three open-weight models.

Naming.

An earlier draft used "GRC" for Geodesic Resonance Caching. Paper A in this series (Geodesic Runtime Compression) uses "GRC" for the calibration-free attention-compression mechanism, an unrelated implemented thing. To remove the collision, the trajectory-library idea is named GTC throughout this paper.

Contributions.
  1. A clean statement of OTT and GTC with assumption-explicit theorem templates (Sec. 5), separating universal claims from deployment-scoped ones.

  2. A measured cache-coverage curve on real LM activation clouds at three model scales, demonstrating empirical scale-invariance within \(\pm 0.5\%\) at the 25%-fraction cache size (Sec. 8).

  3. A batch-Jacobi resonance benchmark showing \(97\times\) speedup at \(B=10\) with reconstruction error at the float64 roundoff floor (Sec. 9).

  4. A compressed record store at 5.96 KB/record with exact rank-5 propagator truncation and a 30.9 μs/query two-stage lookup (Sec. 10).

  5. Two small primitives needed to make the C-runtime path actually run on instruct-tuned drafters: logit-excluding top-1 with min-response guard and the geometry-cache consistency-equivalence rule (Sec. 11).

End-to-end OTT/GTC pipeline. The offline path (Sec. 2--10) fits a Riemannian manifold from a Phase-1 activation export, computes batched Jacobi-field propagators in the Magnus-3 truncation, and writes a compressed record store. The runtime path (Sec. 11) consumes the manifold and the record store via the geometry-cache consistency-equivalence rule, with the \(\ell^\star\!\approx\!2L/3\) depth-sink shortcut making the cache check \(O(1)\) per token. Numbers cited in the boxes are measured on the empirical anchor in [sec:resonance,sec:store,sec:runtime].

Part I, Theory

Organic Training Theory

Let \(\theta\in\mathbb{R}^P\) denote the trained weights of a transformer and let \(f_\theta\) be its forward map up to a chosen layer. Define the activation manifold \[\begin{equation} \mathcal{M}_\theta = \{\,x\in\mathbb{R}^d : x = f_\theta(\text{tokens}) \text{ for some input}\,\}, \end{equation}\] equipped with a Fisher-information metric \(g\) approximated locally from the sample covariance of \(\nabla_\theta\log p_\theta\) pulled back to activation space. Under regularity assumptions made explicit in Sec. 5, \(\mathcal{M}_\theta\) admits a Levi-Civita connection \(\Gamma^k_{ij}(x)\) and inference along a geodesic \[\begin{equation} \ddot x^k(\lambda) + \Gamma^k_{ij}(x)\,\dot x^i(\lambda)\,\dot x^j(\lambda) = 0 \label{eq:geodesic} \end{equation}\] has cost \(\mathcal{O}(nk^2)\) in intrinsic dimension \(k\), against \(\mathcal{O}(n^2 dL)\) for the standard transformer pass. The empirical gap between \(k\) and \(d\) (estimates of 30--50 vs. 4096--8192) is the source of the asymptotic improvement.

Fisher metric and the heat-kernel bridge

A natural-gradient interpretation of training (Golub and Van Loan 1996; Nakkiran et al. 2021) suggests that the metric \(g\) on \(\mathcal{M}_\theta\) is the Fisher information; under additional smoothness assumptions one obtains a Laplace--Beltrami diffusion form whose heat kernel \(G_t(x,y)\) satisfies Varadhan's asymptotic identity \[\begin{equation} d_g(x,y)^2 \;=\; \lim_{t\to 0}\,-4t\log G_t(x,y),\qquad \dot\gamma(0)\;\propto\;\nabla_x\bigl(-\log\mathrm{Attn}(x,y)\bigr). \label{eq:varadhan} \end{equation}\] Eq. [eq:varadhan] should be read as a model-family assumption (the trained attention kernel is the heat kernel at scale \(t\)), not as a universally proved theorem. Where it holds, attention gradients become a geodesic-direction oracle and OTT acquires a constructive starting velocity \(v_0\) for Eq. [eq:geodesic].

From the Fisher metric to the implemented Jacobi propagator

The runtime does not work directly with the Fisher metric tensor; it works with its Magnus-3 Jacobi propagator \(\Phi(\lambda)\), the linear map that advances a deviation vector along a geodesic. The chain that links the two is explicit and worth writing in one place. Given the Fisher metric \(g_{ij}(\theta)=\E_{x\sim p_\theta}\!\bigl[\partial_i\log p_\theta(x)\, \partial_j\log p_\theta(x)\bigr]\), the Levi-Civita connection symbols are \[\begin{equation} \Gamma^k_{ij}\;=\;\tfrac{1}{2}g^{kl}\!\left(\partial_i g_{jl}+\partial_j g_{il}-\partial_l g_{ij}\right), \label{eq:christoffel} \end{equation}\] and the Riemann curvature tensor along a geodesic \(\gamma(\lambda)\) is \(R^k{}_{lij}=\partial_i\Gamma^k_{jl}-\partial_j\Gamma^k_{il} +\Gamma^k_{im}\Gamma^m_{jl}-\Gamma^k_{jm}\Gamma^m_{il}\). A deviation field \(J(\lambda)\) between neighbouring geodesics satisfies the Jacobi equation \[\begin{equation} \frac{D^2 J^k}{d\lambda^2}\;+\;R^k{}_{ijl}\,\dot\gamma^i\,J^j\,\dot\gamma^l\;=\;0. \label{eq:jacobi} \end{equation}\] Writing \(\mathbf{J}=(J,\dot J)\in\R^{2k}\) and assembling the curvature-and-connection block into the \(2k\times2k\) generator \(A(\lambda)\), Eq. [eq:jacobi] becomes \(\dot{\mathbf{J}}=A(\lambda)\mathbf{J}\), whose flow is the Magnus expansion (Magnus 1954) \[\begin{equation} \Phi(\lambda)\;=\;\exp\!\Bigl(\Omega_1+\Omega_2+\Omega_3+\cdots\Bigr),\qquad \Omega_1=\!\int_0^\lambda\!A(s)\,ds,\;\; \Omega_2=\tfrac{1}{2}\!\int_0^\lambda\!\!\int_0^{s_1}\![A(s_1),A(s_2)]\,ds_2\,ds_1, \label{eq:magnus} \end{equation}\] and \(\Omega_3\) is the corresponding nested-commutator integral. The runtime truncates after \(\Omega_3\) (hence "Magnus-3") and integrates \(A\) along waypoints sampled from the geodesic. In one line: the implemented map is \[\begin{equation} \boxed{\;\;g_{ij}(\theta)\;\xrightarrow{\eqref{eq:christoffel}}\;\Gamma^k_{ij}\;\xrightarrow{\eqref{eq:jacobi}}\;A(\lambda)\;\xrightarrow{\eqref{eq:magnus},\,\text{trunc.\,3}}\;\Phi(\lambda).\;\;} \label{eq:fisher-to-phi} \end{equation}\]

The implemented Fisher-to-Jacobi chain. The runtime never materialises the full curvature tensor; instead it samples waypoints, constructs the \(2k\!\times\!2k\) Jacobi generator \(A(\lambda)\) at each waypoint, integrates the Magnus expansion to third order, and stores the resulting propagator \(\Phi(\lambda)\) in the GTC record. Eq.[eq:gtc-correction] is the first-order Taylor expansion that consumes \(\Phi(\lambda)\) at inference time.

Eq. [eq:gtc-correction] is then a first-order Taylor expansion of the flow with \(\Phi(\lambda)\) in the linear coefficient slot, which is exactly what the cache stores. The empirical anchor benchmarks measure the inverse property: that the cached \(\Phi(\lambda)\) correctly predicts deviation vectors on real LM activation clouds, validating the truncation.

Geodesic Trajectory Caching

Each completed inference produces a full geodesic trajectory, not just a terminal answer. A GTC record stores \[\begin{equation*} \mathcal{R} = \bigl(x_0,\,\dot x_0,\,\{x(\lambda_i)\}_{i=1}^M,\,\Phi(\lambda),\,\hat\rho,\,\ell_T\bigr) \end{equation*}\] i.e. the embedding, contextual velocity, waypoint sequence, Magnus-3 Jacobi propagator, an injectivity-radius estimate, and the terminal logits. New queries \(x_0+\delta q\) within the validity radius are served by a single \(\mathcal{O}(k^2)\) matrix--vector product: \[\begin{equation} x^\mu(\lambda) = \bar x^\mu(\lambda) \;+\; \Phi(\lambda)\,\delta q \;+\; \mathcal{O}\bigl(\|\delta q\|^2\bigr). \label{eq:gtc-correction} \end{equation}\]

Resonance.

The Jacobi equation is linear, so a batch of \(B\) correlated queries within the validity radius collapses into a single \(k\times k\times B\) matmul. Throughput therefore rises rather than falls under load, the resonance property. Sec. 9 measures it.

Library convergence.

On a synthetic mixture-of-Gaussians workload the library converges to a 92% hit rate while storing 7.8% of seen queries; on real LM clouds (Sec. 8) the hit-rate at 25%-fraction cache is 90.4--91.5%, scale-invariant.

Geodesic Trajectory Caching (GTC) at runtime. A new query is checked against the validity radius \(\hat\rho\) of the nearest cached record; on hit, the prediction is a single \(\mathcal{O}(k^2)\) matrix-vector product (Eq.[eq:gtc-correction]); on miss, the full forward pass materialises a new trajectory which is inserted back into the library. A batch of correlated near-hits collapses into a single \(k\!\times\!k\!\times\!B\) matmul, the resonance property measured in §9.

Connection to Block Attention Residuals

Block AttnRes (Kimi Team 2026) replaces fixed PreNorm residual accumulation with a softmax over learned pseudo-queries against block summaries \(b_n\). Under the OTT lens, the \(b_n\) are waypoints on the geodesic and the AttnRes weights \(\alpha_{n\to\ell}\) select a convex combination of those waypoints, re-anchoring the hidden state to \(\mathcal{M}_\theta\) and mitigating the \(\mathcal{O}(\sqrt L)\) magnitude inflation that PreNorm produces (Xiao et al. 2023). The interpretation is testable: AttnRes weights should concentrate on the block whose cached representation is geodesically nearest to the current state. The companion runtime paper (HyperTensor Project 2025b) reports an independent C implementation of AttnRes (\(\mathtt{-{}-attnres}\)) and its empirical interaction with compression.

Formal addendum (conditional)

This section sharpens the mathematical structure under stronger, explicit assumptions. It is a conditional formalisation: if the assumptions are accepted for a model family, then the corresponding statements follow.

A. Diffeomorphism \(\phi\).

A separation of base assumptions (Euclidean representation base, LayerNorm quotient structure, star-shaped head-chart preimages, residual flow generated by a smooth Morse-like potential satisfying a Palais--Smale-style compactness condition, and a smooth conformal softmax factor inside each chart) yields a candidate global diffeomorphism \(\phi\) from the trained activation cloud to a model manifold. This is a deployment-scoped construction; universal closure across all transformer families is open.

B. Diffusion--attention chain.

Eq. [eq:varadhan] is the endpoint of the implication chain (i) KL/natural-gradient training on a Fisher manifold \(\Rightarrow\) entropy-gradient flow approximation; (ii) embedding limit \(\Rightarrow\) Laplace--Beltrami diffusion form; (iii) heat-kernel identification \(\Rightarrow\) Varadhan asymptotic.

C. Jacobi propagation via JVP.

For smooth geodesic flow, applying \(\Phi(\lambda)\) to a perturbation is exactly a Jacobian--vector product (Pearlmutter 1994), so JVP gives linear-cost directional propagation without materialising a full Jacobian. With effective curvature rank \(r\ll\dim \mathcal{M}_\theta\), practical cost is restricted to an \(r\)-dimensional subspace; Sec. 10 measures \(r=5\) exact.

D. HJB-regularised AttnRes/GTC objective.

See 16 for a full statement; cross-referenced here for completeness.

E. Ricci-spectral \(\rho\) estimator.

A practical injectivity-radius estimator family, \[\begin{equation} \hat\rho(q) = C\cdot\frac{\pi}{\sqrt{\lambda_{\max}(\mathrm{Cov}(\nabla_x A))}}\cdot \frac{1}{\sigma_{\max}(A)}, \end{equation}\] combines attention-spectrum curvature proxies with a Klingenberg-style lower-bound form. Treated here as an operational template with model-family calibration constants, not a universal equality.

Open problems

The five blockers from the original draft are no longer all in the same category.

  1. Diffeomorphism \(\phi\). Open as a universal construction; deployment-scoped resolved for the OTT manifold family via certificate-backed inherited-structure arguments.

  2. Initial velocity \(v_0\). Universal closed-form open; the runtime ships a curvature-guided endpoint-direction prior with a Christoffel-based correction. The deployment blocker is reduced to calibration of an implemented surrogate.

  3. Jacobi propagator construction cost. No longer a live blocker: cost is paid at library-construction time and amortised by the compressed record store and exact rank-5 truncation (Sec. 10), and the measured batch resonance (Sec. 9) exceeds the analytic estimates by \(4\!-\!14\times\).

  4. AttnRes + GTC joint training. Open as a training problem. Inference-time correction is measured: single-anchor Jacobi transport is promising, simplex blending underperforms.

  5. Injectivity radius. Cheapest universal estimator open; operational closure exists, per-record \(\hat\rho\) and a measured validity-radius sweep keep Jacobi error \(<0.1\%\) to the tested threshold.

Knowledge-injection curvature warp: a measured negative

A related theoretical proposal, inject a knowledge cluster \(c\) into the manifold by Gaussian-warping the metric \(g'(x){=}g(x){-}\alpha\bigl(1{-}\exp(-|x{-}c|^2/2\sigma^2)\bigr)ww^\top\) so the geodesic toward \(c\) contracts, was reduced to a falsifiable sweep on SmolLM2-135M. Across 32 configurations of \((\alpha,\sigma,\delta\ell)\) with the success criterion "\(\geq 50\%\) reduction in target-endpoint geodesic error AND \(\leq 5\%\) spillover at non-target points", 0/32 pass; the best configuration achieves a 16 % target reduction at \(\alpha{=}0.99\) but its non-target spillover diverges (NaN at high \(\alpha\)).

Cross-model replication (2026-04-29).

The single-model result was replicated across three architectures (SmolLM2-135M, Gemma-4-e2b, Phi-3.5-mini) and two protocol variants: (v1) the original Christoffel-Gaussian warp (\(\alpha{=}0.7\), \(\sigma{=}1.2\), \(T{=}16\), \(\delta\ell{=}0.1\)), and (v2) a covariant Riemannian-exponential push of magnitude \(\alpha{=}0.35\) in a radius-\(1.2\) ball, parallel-transported by the cached Christoffel symbols. Twelve configurations total (3 models \(\times\) 2 intrinsic dimensions \(n\in\{6,8\}\) \(\times\) 2 protocols), 146 s wall on a single CPU core. 0/12 satisfy both criteria. The best v1 case (Gemma-4-e2b \(n{=}6\)) achieves a 39.4 % target-error reduction but 60 % p95 spillover; v2 produces negative target improvements on Phi-3.5-mini (target error grows by 9--24 %) and only one configuration meets the spillover criterion alone (Gemma-4-e2b \(n{=}8\) v2: \(-1.8\) % target improvement, 14 % p95 spillover). Phi-3.5-mini v1 \(n{=}6\) produces NaN post-error: the warp drives the geodesic solver out of its convergence basin. The cross-model replication strengthens rather than narrows the negative conclusion: neither protocol generalises and v2 is consistently worse than the trivial baseline on at least one architecture.

The per-configuration breakdown is given in 1. The success criterion (target reduction \(\geq 50\) % and spillover p95 \(\leq 5\) %) is failed in every cell. Only one cell (Gemma-4-e2b \(n{=}6\) v1) breaks \(30\) % target reduction, and that cell has the second-worst spillover in the table. Six of twelve cells produce negative target improvement, meaning the warp moved the geodesic farther from the target than the unwarped baseline; four of these six are v2, the variant that was designed to fix the v1 spillover problem.

Per-configuration curvature-warp results, \(3\) models \(\times\) \(2\) intrinsic dimensions \(\times\) \(2\) protocols, \(146\) s wall on a single CPU core. Target reduction is the relative geodesic-error drop at the injected fact (positive is better); spillover p95 is the worst-case relative geodesic-error growth at non-target points (lower is better). Pass criterion: target \(\geq 50\) % AND spillover p95 \(\leq 5\) %. Cells with NaN are runs where the warp pushed the geodesic solver out of its convergence basin. Raw: docs/figures/curvature_warp/cross_model_summary.json.
Model \(n_\text{int}\) Variant Target red. (%) Spillover p95 (%) Pass?
Gemma-4-e2b 6 v1 christoffel \(39.4\) \(60.3\) no
Gemma-4-e2b 6 v2 covariant \(-57.4\) \(36.9\) no
Gemma-4-e2b 8 v1 christoffel \(5.0\) \(40.8\) no
Gemma-4-e2b 8 v2 covariant \(-1.8\) \(14.4\) no
Phi-3.5-mini 6 v1 christoffel NaN NaN no (NaN)
Phi-3.5-mini 6 v2 covariant \(-9.4\) \(57.8\) no
Phi-3.5-mini 8 v1 christoffel \(-23.9\) NaN no (NaN)
Phi-3.5-mini 8 v2 covariant \(-23.7\) \(157.7\) no
SmolLM2-135M 6 v1 christoffel \(21.6\) NaN no (NaN)
SmolLM2-135M 6 v2 covariant \(-12.9\) \(41.4\) no
SmolLM2-135M 8 v1 christoffel \(-20.9\) \(104.6\) no
SmolLM2-135M 8 v2 covariant \(-22.1\) \(97.7\) no

The mechanism is consistent with the empirical finding of 16 (Empirical feasibility stub): the manifolds exported from frozen pretrained networks are nearly flat in the intrinsic-coordinate basis (kinetic-to-curvature ratio \(> 10^{17}\) in squared L2), so a Gaussian metric perturbation is either too weak to redirect the geodesic or strong enough to redirect but at the cost of a global metric perturbation that destroys the surrounding manifold. We retain this as a documented negative result; the positive form of the problem (curvature warp via learnable \(\phi\) rather than hand-crafted \(g'\)) is the joint-training premise of 16, and the cross-model evidence here is what makes that future training premise non-trivial: any future construction must demonstrably outperform 0/12 hand-crafted variants on the same three-model benchmark before being claimed as a positive result. Raw sweeps: docs/figures/curvature_warp/smollm2-135m_sweep.json (single-model 32-config) and docs/figures/curvature_warp/cross_model_summary.json (cross-model 12-config); status note: docs/figures/curvature_warp/STATUS.md; findings note: docs/figures/findings/FIVE_FINDINGS.md §5.

Part II, Empirical anchor

Setup: fitting the manifold from Phase-1 exports

The C runtime emits one global Christoffel tensor and a per-point metric diagonal in axgeo_christoffel_t; that representation is too coarse for the GTC contract. We fit the manifold entirely in Python from Phase-1 clouds:

Module Role
scripts/gtc/manifold.py \(k\)-NN Mahalanobis metric, log-Euclidean RBF smoothing, finite-difference \(\Gamma^k_{ij}\)
scripts/gtc/geodesic.py RK4 integrator for \(\ddot x^k = -\Gamma^k_{ij}\dot x^i \dot x^j\)
scripts/gtc/jacobi.py Riemann tensor by FD of \(\Gamma\), Magnus-3 propagator \(\Phi(\lambda)\)
scripts/gtc/validity_radius.py \(\varepsilon\)-sweep, emits <case>_validity_radius.json
scripts/gtc/gtc_benchmark.py Coverage benchmark, emits <case>_coverage.json
scripts/gtc/record_store.py Compressed library + two-stage lookup

A sphere sanity at \(K=1, n=4\), 256 samples confirms the harness: validity error scales quadratically in \(\varepsilon\) exactly per the Jacobi bound (\(\varepsilon^\star(5\%)=0.05\), \(\varepsilon^\star(10\%)=0.10\), \(\varepsilon^\star(20\%)=0.20\)). The harness is validated; the LM numbers below are not artefacts.

Online vs. offline manifold density (Llama-3.1-8B).

A late-stage end-to-end run of the same Phase-1 pipeline on Llama-3.1-8B-Instruct (axiom_beta_report.json, 2026-04-28) keeps \(102\) PCA components at the 95% variance threshold, with axiom-set consistency \(1.000\) and Fisher-blend \(0.20\) on \(128\) metric sample points and \(400\) geodesic steps. Sampling the runtime trajectory live (cloud_density_summary.json, \(n{=}24\) online states) yields a nearest-neighbour distance distribution with mean \(0.132\), median \(0.093\), and 90th percentile \(0.348\) in the cosine metric. Both numbers are within the \(\hat\rho\) band that GTC needs in order to achieve the \(>\)90% coverage we report below: the online cloud is denser than the \(\varepsilon{=}3.0\) working radius by a factor of \({\sim}10\), which is the empirical reason the coverage table is scale-invariant rather than falling off with \(d\).

Manifold parameters at frontier scale.

A cold Phase-1 run on Llama-3.1-8B-Instruct Q4_K_M (256 hidden-state samples, 1024-token embedding probe) measures TwoNN intrinsic dimension \(k_\text{int}{=}19.96\) and 132 PCA components for 95.1% variance retention (Phase 1 wallclock 63.6 s; Phase 3 Fisher-blended curvature on 17,408 block-sampled tokens, 31.0 s; Phase 4 axiom set 40 axioms, consistency \(0.9499\), 26 oracle calls). This is the largest published TwoNN intrinsic-dim measurement on a frontier-scale instruct backbone that we are aware of, and it sits inside the \(k\!\approx\!30\)--\(50\) band predicted by prior work (Ansuini et al. 2019; Pope et al. 2021) at the lower edge, consistent with the smoothness hypothesis: as models scale, \(k_\text{int}/d\) collapses (\(20/4096{\approx}0.005\) for Llama-8B vs. \({\approx}0.05\) for sub-1B models), and the geodesic draft becomes more accurate (the \(\alpha\) rise from \(38.5\%\) on SmolLM2-135M to \(46.9\%\) on Llama-8B documented in Paper C (HyperTensor Authors 2026) is the operational signature of this collapse).

Mistral-Nemo-12B replication.

A second frontier-scale Phase-1 run on Mistral-Nemo-Instruct-2407 Q4_K_M (40 layers, \(d{=}5120\), 13.67 B parameters, 256 hidden-state samples, 1024-token probe) yields TwoNN \(k_\text{int}{=}5.36\) and 156 PCA components for 95.0% variance (Phase 1 wallclock 6.1 s; Phase 3 Fisher-blended curvature on 20,480 block-sampled tokens, 63.6 s; Phase 4 10 axioms, consistency \(0.9407\)). The PCA-rank reading is consistent with Llama-8B (132 components, \(d{=}4096\)); the very low TwoNN reading at \(n{=}256\) samples is plausibly an undersampling artefact for a wider model (\(d{=}5120\)) and we report it here as preliminary, the matched sample budget for an apples-to-apples comparison with the Llama-8B 512-sample run is the next manifold-survey item. Two independent 13 B-class instruct backbones now agree that the variance-95% projection rank sits around \(130\text{--}160\), three orders of magnitude below \(d^2\), which is the operationally relevant number for OTT runtime sizing.

Coverage scaling across three models

Coverage is the fraction of held-out activation cloud points within \(g\)-norm distance \(\varepsilon\) of the nearest cached point. All three measurements at \(\varepsilon=3.0\), \(n_{\mathrm{intrinsic}}=8\), \(n_{\mathrm{repeats}}=16\).

GTC cache coverage on three open-weight models. The 25%-fraction column is scale-invariant within \(\pm 0.5\%\) across a \(33\times\) parameter range.
Model Params \(k=6\) (10%) \(k=16\) (25%) \(k=32\) (50%) \(k=48\) (75%)
SmolLM2-135M 135 M 58.6% 91.0% 99.8% 100.0%
Phi-3.5-mini 3.8 B 55.5% 90.4% 98.2% 100.0%
Gemma-4-E2B 4.5 B 58.7% 91.5% 99.6% 100.0%

Finding The scale-invariance prediction (the "flag flip" claim) holds within \(\pm 0.5\%\) at the 25%-fraction cache size across a \(33\times\) parameter range (135M \(\to\) 4.5B). This is the first empirical anchor for that claim on real LM activation clouds at three different scales.

Batch Jacobi resonance

The Jacobi propagator \(\Phi(\lambda)\) is linear in the perturbation, so a batch of \(B\) correlated queries can be corrected in a single matmul.

Batch Jacobi resonance on the SmolLM2-135M cloud. Reconstruction error pinned to the float64 roundoff floor at every batch size.
Batch \(B\) Sequential (ms) Batched (ms) Speedup μs/query rel. error
1 0.015 0.001 14.6\(\times\) 1.000 0
10 0.411 0.004 97.9\(\times\) 0.420 \(1.1\!\times\!10^{-16}\)
100 0.167 0.006 27.4\(\times\) 0.061 \(1.2\!\times\!10^{-16}\)
1 000 1.143 0.026 44.5\(\times\) 0.026 \(1.2\!\times\!10^{-16}\)
10 000 11.100 0.185 60.0\(\times\) 0.0185 \(1.2\!\times\!10^{-16}\)

The analytic estimates for the \(B\in\{10,100,1000\}\) regimes were \(2.7\times/12.5\times/7.0\times\); the numpy-BLAS realisation exceeds them by \(4\)--\(14\times\) because the analytic estimate did not account for cache and SIMD effects. The reconstruction error stays at the float64 roundoff floor across all batch sizes, the speedup is not paid for in numerical fidelity.

Compressed record store

A naive trajectory record is hundreds of KB. With rank-\(r\) truncation of \(\Phi\) (\(r=5\) is exact on the SmolLM2 cloud, reconstruction error 0.0) and waypoint subsampling, persisted records reach 5.96 KB, roughly an order of magnitude under the 50--80 KB design target.

Compressed record store on the SmolLM2 cloud. The two-stage lookup is Euclidean ANN \(\to\) \(g\)-norm refinement.
Quantity Value Design target
Records persisted 24 ,
Total .npz size 143.0 KB ,
Per-record size 5.96 KB 50--80 KB
Rank-5 \(\Phi\) reconstruction error 0.0 "rank \(\approx 5\) is sufficient"
Build wall-clock (24 records, \(k=8\)) 6.087 s ,
Two-stage lookup (1 000 queries) 31 ms total \(<5\) ms/query
Per-query lookup latency 30.9 μs \(<5\) ms (\(\sim\!160\times\) under)
Density caveat.

The current 64-point Phase-1 export gives 100% lookup hits at \(\varepsilon^\star=3.0\) but 0% within the Jacobi validity radius \(\hat\rho=0.4\) on a held-out cloud. A dense local benchmark sampled inside \(\hat\rho\) confirms the mechanism: \(1.43\!\times\!10^{-7}\) mean relative error and \(158\times\) speedup over a full geodesic step. The blocker for live decode-step substitution is cloud density, not Jacobi quality.

OTT runtime anchor

The C host runtime ships an end-to-end OTT pipeline: geometry-cache load, OneDecode bake, speculative decode against a verifier, and a readiness gate that emits ott_readiness_report.json. The pipeline reaches status=geodesic_ready:

Quantity Value Notes
OTT readiness status geodesic_ready ready/hybrid_ready/share/consistency all=1.0
Acceptance rate \(\alpha\) 38.5% 5 geo-accepted / 13 generated
End-to-end throughput 76.5 tok/s 13 tokens in 170 ms; greedy baseline \(\sim\!50\) tok/s
Empirical speedup \(1.53\times\) Within Paper C closed-form prediction at \(\alpha=0.385,\gamma=4\)
OD draft hits 5 OneDecode table hits
Final adaptive batch 4 Stable; did not collapse

The instruct-greedy-EOS pathology

Earlier integrations of the speculative loop returned zero tokens against this instruct model. The cause: the verifier's argmax at position 0 (and several subsequent positions) is the EOS token. A standard speculative loop (Leviathan et al. 2023; Chen et al. 2023) sees an EOS draft and exits. Existing literature (including Medusa (Cai et al. 2024) and EAGLE (Li et al. 2024)) does not document this case because it primarily targets base (non-instruct) backbones where the greedy distribution does not degenerate into EOS.

The fix shipped here is a small primitive we call logit-excluding top-1:

// runtime/nn/llm.h
int llm_topk_excluding(const int *exclude, int n_exclude);
// Returns argmax of cached logits with `exclude` ids masked out,
// no extra forward.

plus a min-response guard \(N_{\min}=4\) that enables the bypass only at positions \(i<N_{\min}\). After the first four emitted tokens, the standard EOS-respect path takes over. The \(\S\)11 numbers are conditional on the fix being in place; removing it returns the loop to 0 tok/s.

Geometry-cache consistency-equivalence

The OTT readiness gate previously failed when geometry was loaded from the persistent cache, because the calibration phase that writes consistency_score is skipped on cache hit and the score defaults to 0. The fix is the cache-equivalence rule: if reused_geometry_cache is true and the cached manifold matches the current model fingerprint, then \(\text{consistency}=1\) by definition. Practically a one-line guard; theoretically the statement that calibration is invariant under fixed-manifold reuse.

Gap analysis vs. the OTT claim list

OTT claim Status Anchor
Christoffel field \(\Gamma\) from \(g\) measured manifold.py
Geodesic ODE integrator measured geodesic.py
Riemann tensor + Jacobi propagator measured jacobi.py
Sphere sanity, quadratic \(\varepsilon\) scaling exact §7
Hit rate \(\ge 65\%\) on clustered distribution 90.4--91.5% §8
Library size sublinear \(k{=}16\) covers 91% of 64-pt cloud §8
Batch matmul \(\equiv\) sequential \(1.2\!\times\!10^{-16}\) rec. err. §9
Batch \(B{=}10/100/1000\) speedups \(97\times,27\times,44\times\) §9
Two-stage FAISS+geodesic lookup 30.9 μs/q §10
Compressed record store (\(\sim\!50\)--\(80\) KB target) 5.96 KB at \(k{=}8\) §10
Scaling: SmolLM2 \(\to\) Phi-3.5-mini "flag flip" \(\pm 0.5\%\) §8
Validity / injectivity radius \(\hat\rho\) scaling \(<0.1\%\) err to \(\varepsilon{=}5.0\) validity_radius.json
OTT locality of curvature warp ratio \(7\!\times\!10^{11}\), decays to 0 at \(20\sigma\) implicit
OTT runtime end-to-end partial: \(\alpha=0.385\), \(1.53\times\) §11
Knowledge-injection curvature warp negative: 0/32 single-model + 0/12 cross-model curvature_warp/
AttnRes block-summary integration prototype: block-end Jacobi err 1.29% attnres_integration.json
Diffeomorphism \(\phi\) construction resolved for OTT family via certificates deployment
Initial velocity \(v_0\) universal open; deployable surrogate exists axiom_beta.c

Reading: 12 measured pass, 1 measured fail (curvature-warp knowledge-injection), 2 measured partial (live-decode replacement, AttnRes integration), 1 universally open / deployment-resolved (\(\phi\)), 1 universally open / deployable surrogate (\(v_0\)). The OTT programme is no longer "framework only": it is a framework with a verified core and a short, named list of open items.

What is genuinely new here

  1. Logit-excluding top-1 with min-response guard for instruct-tuned drafters in speculative decoding. Closes the instruct-greedy-EOS failure mode without forward-pass overhead. We are not aware of a published treatment in the existing speculative-decoding literature (Leviathan et al. 2023; Chen et al. 2023; Cai et al. 2024; Li et al. 2024).

  2. Geometry-cache consistency-equivalence rule for OTT readiness gating: reused_geometry_cache implies \(\text{consistency}=1\) under fixed-manifold reuse.

  3. Empirical scale-invariance of cache coverage across a \(33\times\) parameter range at fixed sample budget, the first measurement of this prediction on real LM clouds at three scales.

The other components, geodesic ODE, Jacobi propagator, GP compression, OneDecode, the speculative-decoding rejection rule, are inherited from prior work and are explicitly cited as such.

Reproduction

# Phase-1 cloud export (emits docs/figures/gtc/<case>.json)
python scripts/gtc/manifold.py    --case smollm2-135m
python scripts/gtc/jacobi.py      --case smollm2-135m
python scripts/gtc/gtc_benchmark.py --case smollm2-135m
python scripts/gtc/record_store.py  --case smollm2-135m

# OTT runtime anchor (emits ott_readiness_report.json)
./geod --ott --gguf SmolLM2-135M-Instruct-Q8_0.gguf \
       --prompt "Hello" --gamma 4 --max-tokens 13

Limitations

Future Work: HJB-Regularised Joint Training

The OTT runtime, as shipped, is an inference-time geometric overlay: the base weights are frozen and the manifold structure is read off the pre-trained network. A natural extension is to make the network and the manifold co-trained, by adding a Hamilton--Jacobi--Bellman (HJB) style regulariser that forces residual-stream trajectories to be Jacobi-consistent under the runtime's own curvature estimator.

Let \(s_\ell\in\mathbb{R}^{d}\) denote the block summary at layer \(\ell\in\{0,\dots,L\}\) (the residual-stream state at the block boundary), let \(\Delta s_\ell := s_{\ell+1}-s_\ell\) be the discrete velocity, and let \(\Delta^2 s_\ell := s_{\ell+1}-2s_\ell+s_{\ell-1}\) be the discrete acceleration. Let \(\hat R(s)\in\mathbb{R}^{d\times d}\) be the runtime's local sectional-curvature operator (the same \(\hat R\) used by the Jacobi propagator \(\Phi(\lambda)\) of 2.2). Then the finite-difference Jacobi residual \[\begin{equation} \mathcal{J}_\ell(s) \;:=\; \Delta^2 s_\ell \;+\; \hat R(s_\ell)\,\Delta s_\ell \label{eq:jacobi-residual} \end{equation}\] is the discrete Bellman residual of an HJB-style optimality condition on the layer-indexed trajectory \(\ell\mapsto s_\ell\). We propose the joint objective \[\begin{equation} \boxed{\; L_{\mathrm{SHF}}(\theta) \;=\; L_{\mathrm{task}}(\theta) \;+\; \lambda\,\sum_{\ell=1}^{L-1} \bigl\|\,\mathcal{J}_\ell(s_\ell;\theta)\,\bigr\|_2^{\,2} \;} \label{eq:hjb-loss} \end{equation}\] with \(\lambda\in[10^{-3},10^{-1}]\), computed on the same block summaries used by the GTC runtime so that the regulariser is architecturally free at the storage layer. We refer to this as the SHF (Spectral Hamiltonian Flow) loss.

Why this is the right form.

[eq:hjb-loss] is a finite-difference enforcement of the geodesic equation \(\nabla_{\dot\gamma}\dot\gamma=0\) in the linearised regime where \(\hat R\) controls the second variation; equivalently, it is the discrete-time HJB residual of a value function whose generator is the Fisher--Rao Laplacian of 2.2. Models trained with \(L_{\mathrm{SHF}}\) should, by construction, admit sharper Magnus-3 propagators, larger admissible step \(\lambda\), and lower OTT verifier disagreement under speculative decoding.

Status.

This loss is not implemented in the reference runtime that produces the measurements reported in this paper; the runtime is inference-only and the base network is frozen. We state [eq:hjb-loss] here so that any future training run that uses this exact penalty is on the public record as derived from the OTT/GTC framework.

Empirical feasibility stub (2026-04-29).

While \(L_\text{SHF}\) is not yet a training objective, we can already measure \(\bigl\|\mathcal{J}_\ell(s_\ell)\bigr\|_2^2\) on trajectories through each model's fitted Phase-3 manifold to fix the order of magnitude that any future trainer must reach. We compute [eq:jacobi-residual] along three trajectory classes (intrinsic dim \(n{=}8\), \(L{=}32\) steps, 12 trajectories per class): (i) baked geodesics integrated by Magnus-3 forward Euler (\(x_{t+1}{=}x_t{+}\delta\,v_t,\; v_{t+1}{=}v_t{-}\delta\,\Gamma^k_{ij}v^iv^j\)), the SHF floor; (ii) \(K{=}4\)-nearest-neighbour walks through the Phase-3 cloud, the manifold-conformant regime; (iii) random straight-line interpolations between two cloud endpoints the off-manifold ceiling.

Kinetic (\(\Delta^2 s_\ell\)) and curvature (\(\hat R(s_\ell)\Delta s_\ell\)) summands are reported separately. On all three exported manifolds (smollm2-135m, gemma-4-e2b, phi-3.5-mini) the kinetic term dominates the curvature term by 17--32 orders of magnitude in squared L2; the off-manifold straight-line baseline produces residuals of order \(10^{-30}\) in the units of the unit-scaled manifold (it is over-determined by definition, a straight line in cloud-coordinates is its own discrete geodesic against an isotropic metric), the NN-walk class produces \(7\)--\(93\) in the same units, and the baked geodesic floor sits between \(10^{-34}\) and \(10^{-3}\). The per-residual cost at \(n{=}8\) is \(\approx\!5.6\!\times\!10^{5}\) FLOPs, dominated by the \(O(n^4)\) Riemann-tensor evaluation, which the runtime already caches per sample point in service of the Jacobi propagator. Two conclusions follow for the SHF training prescription: (a) the loss term is empirically tractable (sub-second compute for one \(L{=}32\) trajectory at \(n{=}8\)), and (b) on manifolds exported from frozen pretrained networks the curvature contribution is negligible, so \(L_\text{SHF}\) in its current form would act primarily as a second-difference smoothness penalty on block summaries unless the trainer co-evolves a manifold of demonstrably non-zero curvature (which is the joint-training premise of this section and a falsifiable target). Raw data: docs/figures/paper-d/hjb_feasibility/hjb_residual_magnitudes.json; script:
scripts/hjb_feasibility.py.

Ansuini, Alessio, Alessandro Laio, Jakob H Macke, and Davide Zoccolan. 2019. "Intrinsic Dimension of Data Representations in Deep Neural Networks." NeurIPS.
Cai, Tianle, Yuhong Li, Zhengyang Geng, et al. 2024. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. https://arxiv.org/abs/2401.10774.
Chen, Charlie, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating Large Language Model Decoding with Speculative Sampling. https://arxiv.org/abs/2302.01318.
Elhage, Nelson, Tristan Hume, Catherine Olsson, Nicholas Schiefer, et al. 2022. Toy Models of Superposition. Anthropic Transformer Circuits Thread.
Golub, Gene H., and Charles F. Van Loan. 1996. Matrix Computations. 3rd ed. Johns Hopkins University Press.
Grattafiori, Aaron et al. 2024. The Llama 3 Herd of Models. https://arxiv.org/abs/2407.21783.
HyperTensor Authors. 2026. "Geodesic Speculative Decoding (Companion Paper)." Unpublished manuscript.
HyperTensor Project. 2025a. Geodesic Projection: A Multi-Slot Compression Pipeline with Adaptive Extensions.
HyperTensor Project. 2025b. Geodesic Speculative Decoding Under Compression.
Kimi Team. 2026. Block Attention Residuals for Stable Deep Transformers. https://arxiv.org/abs/2603.15031.
Leviathan, Yaniv, Matan Kalman, and Yossi Matias. 2023. "Fast Inference from Transformers via Speculative Decoding." ICML.
Li, Yuhui, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. https://arxiv.org/abs/2401.15077.
Magnus, Wilhelm. 1954. "On the Exponential Solution of Differential Equations for a Linear Operator." Communications on Pure and Applied Mathematics 7 (4): 649--73.
Nakkiran, Preetum, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. 2021. Deep Double Descent: Where Bigger Models and More Data Hurt. https://arxiv.org/abs/1912.02292.
Pearlmutter, Barak A. 1994. "Fast Exact Multiplication by the Hessian." Neural Computation 6 (1): 147--60.
Pope, Phillip, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. 2021. "The Intrinsic Dimension of Images and Its Impact on Learning." ICLR.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. "Attention Is All You Need." Advances in Neural Information Processing Systems 30 (NeurIPS).
Williams, Samuel, Andrew Waterman, and David Patterson. 2009. "Roofline: An Insightful Visual Performance Model for Multicore Architectures." Communications of the ACM 52 (4): 65--76.
Xiao, Guangxuan, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient Streaming Language Models with Attention Sinks. https://arxiv.org/abs/2309.17453.