Papers 1, 2, and 3 are empirical. They report measurements on real hardware running real LLMs. This paper is still primarily a theoretical framework, but it is no longer fair to describe it as simulation-only in every respect. The GTC components now have real-manifold anchors on exported LM activation clouds, and the diffeomorphism question is resolved for the current OTT deployment manifolds in this repository via inherited-structure certificates. What remains open is the universal construction and the full runtime deployment path. Where this paper claims a speedup figure (e.g. 4800× at $n{=}32{,}768$), that figure should still be read as conditional on solving the remaining deployment problems, not as a production benchmark claim.
Current status is narrower and stronger than the original Paper 4 wording: the map $\phi$ is still open as a universal transformer-scale construction, but for the measured OTT manifolds in this repository (SmolLM2, Phi-3.5-mini, Gemma-4-E2B) the deployment-scoped diffeomorphism requirement is treated as resolved by certificate-backed inherited structure. That is a practical closure for this repo's OTT regime, not a claim that the general mathematical problem is solved.
Abstract
We treat the trained latent space of a transformer as a Riemannian manifold $\mathcal{M}_\theta$ of intrinsic dimension $k \approx 30--50$, equipped with a Fisher-information metric. Under that view, inference is approximately the solution of the geodesic equation, with cost $\mathcal{O}(nk^2)$ rather than the standard $\mathcal{O}(n^2 d L)$. We then propose Geodesic Trajectory Caching (GTC): a self-improving library of stored geodesics, where new queries are served by a Jacobi-field linear correction against the nearest stored trajectory at cost $\mathcal{O}(k^2)$ per query. We connect this picture to Block Attention Residuals (AttnRes, Kimi Team 2026) by reading AttnRes weights as depth-wise geodesic-segment selection on $\mathcal{M}_\theta$. Since the original draft, parts of the proposal have acquired real-manifold anchors: cache coverage, batch Jacobi correction, compressed record storage, and an AttnRes-style block-summary correction prototype have all been measured on fitted LM manifolds. What is left open is not the local geometry itself but the full runtime deployment path and a universal construction of $\phi$.
"GRC" disambiguation
The earlier draft of this paper used "GRC" for Geodesic Resonance Caching. Paper 1 on this site uses "GRC" for Geodesic Runtime Compression, a different, implemented thing. To remove the collision, the trajectory-library idea is renamed in this paper to Geodesic Trajectory Caching (GTC). The "resonance" terminology was always slightly metaphorical; the substantive content is the trajectory library plus the Jacobi correction, which the new name reflects more accurately.
The manifold view
Let $\theta \in \mathbb{R}^P$ be the trained weights. Define
with metric approximated by the Fisher Information matrix. Empirical intrinsic dimension estimation (PCA, TwoNN) on activation spaces gives $k \approx 30--50$ for current LLMs despite $d \in \{4096, 8192\}$. Inference under this view solves the geodesic equation with cost $\mathcal{O}(nk^2)$.
This section will reproduce the OTT material from the original PDF, with citations to the linear-representation-probing literature that supports the low-dimensional structure claim.
Storing geodesics, correcting cheaply
Each completed inference produces a full geodesic trajectory, not just an answer. We propose storing these trajectories, the embedding, contextual velocity, waypoint sequence, Jacobi propagator $J(\lambda) \in \mathbb{R}^{k \times k}$, an injectivity radius estimate, and the terminal logits, in a library indexed by nearest-neighbour search. New queries within the validity radius are served by a single $\mathcal{O}(k^2)$ matrix-vector product:
The Jacobi equation is linear, so a batch of similar queries can be corrected in a single matmul, a "resonance" effect where throughput rises rather than falls under load. The simulation suite verifies linearity (err $7.3 \times 10^{-13}$), superposition ($1.6 \times 10^{-17}$), and exactness in flat regions ($5.6 \times 10^{-17}$). On a synthetic mixture-of-Gaussians query distribution, the library converges to a 92% hit rate while storing only 7.8% of seen queries.
AttnRes as depth-wise geodesic segment selection
Block AttnRes (Kimi Team, arXiv:2603.15031) replaces fixed PreNorm residual accumulation with a softmax over learned pseudo-queries against block summaries. Under the manifold interpretation, block summaries $b_n$ are waypoints on the geodesic, and the AttnRes attention weights $\alpha_{n \to l}$ select a convex combination of those waypoints. This re-anchors the hidden state to $\mathcal{M}_\theta$, mitigating the $\mathcal{O}(\sqrt{L})$ magnitude inflation that PreNorm produces. Importantly, this connection is testable: AttnRes weights should concentrate on the block whose cached representation is geodesically nearest to the current state. We verify this in the simulation suite (8/8 trials).
The HyperTensor runtime ships an independent implementation of AttnRes
(--attnres); Paper 3 reports its empirical interaction with
compression. Theory and practice meet there.
Assumption-explicit theorem templates
This section sharpens the mathematical structure of Paper 4 using stronger, explicit assumptions. It should be read as a conditional formalization roadmap: if these assumptions are accepted for a model family, then the corresponding theorem statements follow. It is not a claim that all assumptions are already verified for all transformers.
A. Diffeomorphism $\phi$ with explicit assumptions
A stronger statement can be made by separating base assumptions from conclusion: Euclidean representation base space, LayerNorm quotient structure, star-shaped head-chart preimages, residual-flow dynamics generated by a smooth Morse-like potential satisfying a compactness condition (Palais–Smale style), and a smooth conformal Softmax factor inside each chart. Under this package, the time-$T$ flow map is formulated as a global diffeomorphism candidate and the chart transitions are smooth by composition.
This aligns with the deployment-scoped certificate story already used in the repository while making the logical dependence explicit: universal closure still requires these assumptions to hold beyond the measured OTT manifold family.
B. Diffusion–attention lemma chain and log map
The paper can be structured as a three-step implication chain: (1) KL/natural-gradient training on a Fisher manifold gives an entropy-gradient flow approximation, (2) in a suitable embedding limit this induces a Laplace–Beltrami diffusion form, and (3) if the trained attention kernel is identified with the corresponding heat kernel at scale $t$, then Varadhan's asymptotic relation yields
The key quality improvement is epistemic hygiene: the heat-kernel identification is treated as a model-family assumption, not silently as a theorem.
C. Jacobi propagation via JVP
The Jacobi-action claim is naturally theorem-shaped: for smooth geodesic flow, applying $J(\lambda)$ to a perturbation is exactly a Jacobian–vector product, so Pearlmutter-style JVP gives linear-cost directional propagation without materializing full Jacobians. With effective curvature rank $r \ll \dim M$, the practical cost can be restricted to an $r$-dimensional subspace. This is fully consistent with the repository's batched-Jacobi implementation direction.
D. HJB-regularised AttnRes/GTC training objective
For future joint training (not yet in this repo), an SHF-style regularizer can be written as a finite-difference Jacobi penalty over block summaries,
with the interpretation that minimising this term encourages trajectories that are closer to discrete HJB/Jacobi-consistent paths in summary space.
E. Ricci-spectral safety bound and operational $\rho$ estimator
A practical path for cheap injectivity-radius estimation is to treat curvature proxies from attention spectra/gradient covariance as bounded surrogates for local sectional curvature and combine them with a Klingenberg-style lower-bound form, giving an estimator family
In this paper this should be treated as an operational estimator template with model-family calibration constants, not as a universal exact equality.
What remains open
The original draft's five blockers are no longer all in the same category. In the current repository, three are deployment-scoped engineering closures, one is a deployment-scoped mathematical closure, and one remains a genuine research problem if the goal is end-to-end AttnRes+GTC training rather than inference-time correction.
- The diffeomorphism $\phi$. As a universal transformer-scale construction, this remains open. For the concrete OTT deployment manifolds in this repository, however, the requirement is treated as resolved via certificate-backed inherited-structure arguments on star-shaped manifolds. This is a deployment-scoped closure, not a universal one, and it does not come from a Hodge-theoretic derivation.
- Geodesic initial velocity $v_0$. A universal closed-form derivation remains open, but the runtime no longer lacks a deployable substitute: the OTT path already uses a curvature-guided initial-velocity prior that starts from the endpoint direction and applies a Christoffel-based local acceleration correction. So the mathematical derivation is still open, while the deployment blocker has been reduced to validation and calibration of an implemented surrogate.
- Jacobi propagator construction cost. This is no longer a live blocker in the repository. The cost is paid at library-construction time and is amortized by the compressed record store, exact low-rank $\Phi$ truncation on the measured small-cloud regime, and batched Jacobi resonance results that already exceed the paper's analytic speedup estimates. The remaining issue is offline build throughput, not missing theory or missing runtime machinery.
- AttnRes + GTC joint training. If GTC is to correct AttnRes block summaries via Jacobi fields, the training objective must encourage Jacobi-smooth trajectories in block-summary space. This is still open as a training problem. What is already complete is the weaker inference-time claim: the repo has a measured AttnRes correction prototype, and its current result is that single-anchor Jacobi transport is promising while simplex blending underperforms.
- Injectivity radius estimation. Exact per-record estimation from scratch remains expensive in the abstract, but for the current GTC pipeline the requirement is already handled operationally: the record store carries a per- record $\rho$ estimate, and the measured validity-radius sweep on the fitted LM manifold shows the Jacobi regime stays below 0.1 % error out to the tested threshold. So this item is deployment-scoped closed even though the cheapest possible universal estimator is still open.
Where this paper sits
Framing: still primarily a theory paper, but now partially anchored by real LM manifold measurements. The remaining distance from full OTT deployment is much smaller than the original draft implied: the repo now has deployment-scoped diffeomorphism closure for its current manifold family, real-manifold GTC measurements, a deployable $v_0$ surrogate, operational $\rho$ estimates, and an AttnRes correction prototype. The main unsolved pieces are a runtime-integrated decode path at useful cloud density, denser live activation telemetry, a universal $\phi$ construction, and any future attempt to make AttnRes+GTC a jointly trained objective rather than an inference-time correction scheme. Treat the large speedup figures as conditional on those remaining steps, not as current benchmark claims.
What this paper is not
Paper 4 is the theory layer of the OTT/GTC programme. The boundaries below are deliberate, and they should be read together with the measurement-side limitations in Paper 5 (which is the empirical companion to this paper) and Paper 3 (which carries the speculative- decoding anchor).
- Universal vs. deployment-scoped. The diffeomorphism $\phi:\mathcal{M}_\theta\to\mathbb{R}^k$ is closed for the specific OTT deployment manifolds in this repository via inherited-structure certificates on star-shaped manifolds; it is not closed as a universal transformer-scale construction. Treat $\phi$-existence as a per-build property of the fitted manifold, not as a theorem.
- Speedup figures are conditional. Numerical speedups in §§1--3 (e.g. $\mathcal{O}(nk^2)$ vs $\mathcal{O}(n^2 dL)$, the $4800\times$ figure at $n{=}32{,}768$) are derived from the geodesic cost model assuming GTC fully replaces standard attention in the decode path. The repository does not yet ship that runtime; the measured OTT runtime anchor (Paper 5 §6, Paper 3 §5.5) is a compression-draft + verifier pipeline at $\sim76.5$ tok/s, not a pure-geodesic decode loop.
- Manifold dimension claim is empirical. The intrinsic dimension $k\approx30$--$50$ is supported by the Phase-1 fits on SmolLM2-135M, Phi-3.5-mini, and Gemma-4-E2B in Paper 5 §3. It is not a theorem about transformers in general; cross-architecture generalisation past these three fits is open.
- Fisher-information metric is an approximation. The Riemannian metric is approximated from the Phase-1 covariance estimate, not from full Fisher information. This is consistent with how the runtime treats the manifold operationally, but a derivation that links the covariance approximation to the true Fisher metric in the transformer setting is not given here.
- AttnRes + GTC joint training is unsolved. The connection in §3 is read in one direction only: AttnRes weights are interpretable as geodesic-segment selection. The reverse, training AttnRes to be Jacobi-smooth against a GTC library so that the correction objective and the training objective agree, is open as a training problem.
- Injectivity radius and $v_0$ derivations. The runtime has a deployable surrogate for both (per-record $\rho$ estimate, curvature-guided $v_0$ prior with Christoffel correction), and the Jacobi validity radius is measured in Paper 5; closed-form universal derivations are not given.
In short: this paper sets the geometry, Paper 5 measures it on three fitted manifolds, and Paper 3 anchors the end-to-end decode path. Where the three disagree, the measurement wins.
Where to find definitions
For brevity Paper 4 reuses the glossary tables in Paper 1 §0.5 (rank $r$/$k$, residual stream, decode vs prefill), Paper 2 §0.5 (PCA basis, projection slot, geometry cache, depth-sink), and Paper 3 §0.5 (acceptance rate $\alpha$, draft/verifier, OneDecode). Terms specific to this paper, manifold $\mathcal{M}_\theta$, intrinsic dimension $k$, Fisher-information metric, geodesic, Jacobi field, injectivity radius $\rho$, exponential map $\exp_p$, parallel transport, diffeomorphism $\phi$, are introduced inline at first use.
Selected refs
- Stewart, W. K. O., Geodesic Trajectory Caching and the OTT Runtime Anchor, this site, Paper 5 v0.1, 2026.
- Stewart, W. K. O., Composing Compression: Geodesic Speculative Decoding and Attention Residuals, this site, Paper 3 v0.3, 2026.
- Stewart, W. K. O., Geodesic Projection: A Production Compression Pipeline for LLM Inference, this site, Paper 2 v0.2, 2026.
- Stewart, W. K. O., Attention Compression at Constant Quality: A Geometry-Only PCA Basis for Q/K/V, this site, Paper 1 v0.4, 2026.
- Kimi Team, Block Attention Residuals, arXiv:2603.15031, 2026.
- Amari, S., Information Geometry and Its Applications, Springer, 2016. (Fisher-information metric, natural gradient.)
- do Carmo, M. P., Riemannian Geometry, Birkhäuser, 1992. (Geodesic equation, Jacobi fields, injectivity radius, exponential map.)
- Lee, J. M., Introduction to Smooth Manifolds, 2nd ed., Springer, 2013. (Diffeomorphisms, smooth structure on $\mathcal{M}_\theta$.)
- Magnus, W., On the exponential solution of differential equations for a linear operator, Comm. Pure Appl. Math., 1954. (Series used for parallel-transport approximations.)
- Tenenbaum, J. B., de Silva, V., and Langford, J. C., A global geometric framework for nonlinear dimensionality reduction, Science, 2000. (Manifold-hypothesis precedent for low-dimensional structure in high-dimensional representations.)