Paper B - GP

Geodesic Projection: per-layer rank, MCR allocation, and the depth-sink shortcut

HyperTensor Project (William Ken Ohara Stewart)

HyperTensor Project · April 2026 · PDF · TeX source

The Geodesic Projection Pipeline

Introduction and Scope

The companion paper isolates a single design choice, calibration-free attention-only fixed-rank compression, and reports an end-to-end measurement on Llama-3.1-8B-Instruct Q4_K_M (Stewart 2026). Geodesic Projection is the full compression pipeline that the geodessical runtime implements. It extends the companion-paper construction along three axes, discussed below in turn.

Project nomenclature

We use a four-level naming hierarchy across this paper series. HyperTensor is the project (compiler, runtime, kernels, theory). Geodesic Projection, the subject of this paper, is the inference-time pipeline that factorises every compressible weight along its dominant geometric directions. Geodesic Runtime Compression (GRC) is the attention-only subset of GP and is the construction measured by the companion paper. OTT and GTC are the Riemannian-manifold theoretical framework that motivates GP and are treated separately. The set inclusion is GRC \(\subset\) GP \(\subset\) HyperTensor, with OTT and GTC as the theoretical superstructure.

Three axes of extension

The companion paper compresses three slots (\(Q,K,V\)) at a single uniform rank with no calibration data. GP generalises along three axes, summarised below.

Slot coverage



GP compresses all seven slots, \(Q\), \(K\), \(V\), \(O\), FFN up, FFN gate, and FFN down on a dedicated SVD path.

Per-layer rank



A manifold-curvature-ratio (MCR) heuristic budgets rank by layer with a hard floor.

Persistent caching



A two-tier on-disk artifact stores the basis and the projected weights and is validated by a single depth-sink layer spectrum hash.

Scope of this revision

This revision does not provide multi-model end-to-end PPL or throughput sweeps. The static-GP measurement is single-model, single-hardware, and is reproduced from the companion paper. Llama-3.1-70B numbers are not included; compute is approved and the run is queued. We use the term design-validated for code paths that exist, run without error, and are consistent with their components, but for which no end-to-end benchmark has produced numbers in this revision.

Status of the four adaptive mechanisms

This revision ships four adaptive mechanisms. Of those, exactly one (MCR, 4) is the measurement target of this paper. Phase-aware per-layer rank allocation has a complete implementation, a documented calibration recipe, and an executed A/B against the shared-rank baseline of the companion paper. The result of that A/B is a measured null at \(k_\text{target}{=}1024\) on Llama-3.1-8B. In a clean repeat with no co-resident workloads, MCR matches the shared-rank baseline within run-to-run noise (\(-0.13\%\), \(1{\times}\) variance). An earlier contaminated pass spuriously suggested \(-2.2\%\) with \(21{\times}\) variance; the variance ratio was the contamination signal and the measurement was retaken. The null finding at the cache-fit knee is itself the contribution. Phase-aware rebalancing neither helps nor hurts above the knee, and the open question becomes a sweep at lower base ranks where the knee is closer.

The remaining three mechanisms, the GL\((d)\) Axiom Gauge (2), thermal-aware rank (3) and the rejection-driven Oja online basis (4), are presented as v0.2 design specifications. The geometry is derived, the algorithm is given, and the runtime hooks exist, but the paired ablations are deferred to a companion v0.2 release.

Cross-Architecture Manifold Evidence

End-to-end Geodesic Projection (GP) pipeline. Per layer, the seven compressible weights (left, red) feed a manifold fit driven by both their Frobenius spectra and a small set of activation samples. The fit produces (i) an MCR phase tag (Mix / Compress / Refine, §1) and (ii) a per-layer rank \(k_\ell\) which together populate a persistent geometry cache. At inference time, the runtime reads the compressed slots and the cache; rejected speculative tokens or drift signals feed back (dashed) into the online Oja update of the basis (§4). GRC (the construction measured in the companion paper) is the same pipeline restricted to the three attention slots \(\{\Wmat_Q,\Wmat_K,\Wmat_V\}\) with \(k_\ell\) uniform across layers.

The premise of GP is that the local activation manifold of a trained transformer has intrinsic dimension much smaller than the ambient model dim \(d\). If that premise is false, low-rank weight projection should be uniformly destructive. We ran the manifold-extraction pipeline of legacy/axiom_vis/ (sampling \(\to\) symmetry detection \(\to\) Ricci curvature \(\to\) axiom-set extraction) on three open-weight models.

Intrinsic-dimensionality estimates (PCA, 95% variance retained) across three open-weight architectures. Source data: legacy/axiom_vis/<model>/phase{1,4}_*.json.
Model \(d\) \(k_\text{int}\) \(k_\text{int}/d\) Axiom set Consistency
SmolLM2-135M 576 17 2.95% 24 / 96 0.921
Gemma-4-E2B 1,536 25 1.63% 23 / 92 0.961
Phi-3.5-mini 3,072 11 0.36% 22 / 96 0.959

The three values \(\{11,17,25\}\) span an order of magnitude in \(d\) but do not grow with \(d\). This is consistent with the isotropy/anisotropy literature on contextual embeddings and with the linear-representation hypothesis (Elhage et al. 2022). Activation-space intrinsic dim is necessary but not sufficient for the GP construction; the actual weight-space evidence is per-slot SVD spectra (3).

Fourth measurement: Llama-3.1-8B at \(d{=}4096\).

A 2026-04-28 run of the same Phase-1 pipeline on Llama-3.1-8B-Instruct (reported in axiom_beta_report.json) keeps \(102\) PCA components at the same 95% variance threshold, so \(k_\text{int}/d = 102/4096 = 2.49\%\). This is the same order-of-magnitude regime as the smaller models and is not a per-architecture artefact. The empirical GRC operating point \(k{=}1024\) in the companion paper sits ten times above this intrinsic dimension, which is consistent with the \(13.30\%\) PPL penalty at \(k{=}1536\) being cache-fit dominated rather than information-loss dominated.

The Seven Slots and Their Spectra

The seven compressible matrices in a Llama-style block, with shapes for Llama-3.1-8B (\(d{=}4096\), \(d_\text{ffn}{=}14336\), GQA (Ainslie et al. 2023)): \(Q,O\!:\![4096,4096]\); \(K,V\!:\![1024,4096]\) (8 KV heads \(\times\) 128); \(W_\text{up},W_\text{gate}\!:\![14336,4096]\); \(W_\text{down}\!:\![4096,14336]\).

Singular-spectra summary for all seven slots, layer-mean over \(\{0,7,15,23,31\}\) on Llama-3.1-8B Q4_K_M (dequantised).
Slot Shape Rank for 95% Rank for 99% Energy@\(k{=}1024\)
\(Q\) [4096, 4096] 1,682 2,434 0.836
\(K\) [1024, 4096] 605 802 1.000 (GQA cap)
\(V\) [1024, 4096] 809 954 1.000 (GQA cap)
\(O\) [4096, 4096] 2,118 2,868 0.743
\(W_\text{gate}\) [14336, 4096] 3,271 3,840 0.539
\(W_\text{up}\) [14336, 4096] 3,360 3,873 0.492
\(W_\text{down}\) [4096, 14336] 3,345 3,863 0.490

Asymmetry between attention and FFN.

Attention slots are highly compressible (\(k_{95}\approx 0.4\) to \(0.5\,d\)) and retain about \(80\%\) of energy at \(k{=}1024\). FFN slots require \(k_{95}\approx 0.8\,d\) and lose about half of Frobenius energy at \(k{=}1024\). This is the quantitative version of the qualitative "FFN is flat" observation in Geva et al. (Geva et al. 2021). We give \(W_\text{down}\) a dedicated SVD path rather than the Gram-eigendecomposition route used for the other six slots, because the construction is mathematically equivalent in the rank-truncated limit but numerically more stable when the spectrum is approximately flat (Golub and Van Loan 2013; Eckart and Young 1936).

Geometric reading of the asymmetry.

Attention and FFN solve different problems. The attention block is a routing circuit: \(\Wmat_Q,\Wmat_K\) project tokens into a similarity space whose discriminative directions are concentrated in a few hundred coordinates (head subspaces plus the residual stream's preferred basis), and \(\Wmat_V,\Wmat_O\) assemble a context-conditional message in that low-rank space. Routing is structurally low-rank because the number of things being routed is small relative to \(d\). The FFN, by contrast, behaves as a key-value memory bank (Geva et al. 2021), with each FFN column responding to a localised input pattern, so a high-quality FFN must span a high-dimensional set of keys. Removing FFN directions removes individual memories, while removing attention directions removes routing fidelity that is partially recoverable from the residual stream. The predicted consequences are the roughly \(3.4\times\) rank gap between FFN and attention \(k_{95}/d\) and the runtime observation that aggressive FFN low-rank truncation is unstable at 8B scale while attention truncation remains benign down to \(k{=}1024\). GP therefore treats \(W_\text{up}\) and \(W_\text{gate}\) with a high-floor strategy and isolates \(W_\text{down}\) on its own SVD path.

Per-Layer Rank Selection (MCR)

Constrained-budget formulation.

Let layer \(\ell\in\{1,\dots,L\}\) have spectrum \(\{\sigma_{\ell,1}\geq\sigma_{\ell,2}\geq\cdots\}\) and define the captured Frobenius energy at rank \(k_\ell\) as \[\begin{equation} \mathcal{E}_\ell(k_\ell)\;=\;\frac{\sum_{i\le k_\ell}\sigma_{\ell,i}^2}{\sum_i\sigma_{\ell,i}^2}\;\in[0,1]. \label{eq:Eell} \end{equation}\] GP allocates rank by approximately solving the budget-constrained maximisation \[\begin{equation} \max_{\{k_\ell\}}\;\sum_{\ell=1}^{L}\mathcal{E}_\ell(k_\ell) \quad\text{s.t.}\quad \sum_{\ell=1}^{L}k_\ell\le K_\text{budget},\;\; k_\text{floor}\le k_\ell\le K_\text{max}. \label{eq:mcr} \end{equation}\] Because each \(\mathcal{E}_\ell\) is concave and non-decreasing, the Lagrangian KKT conditions reduce to a water-filling rule: spend the next rank slot on whichever layer has the largest marginal energy gain \(\sigma_{\ell,k_\ell+1}^2 / \sum_i\sigma_{\ell,i}^2\). The MCR heuristic below is a closed-form proxy for this rule that uses the manifold curvature ratio in place of an expensive per-step spectrum lookup.

A naive scheme assigns the same \(k\) to every layer. GP exposes a manifold-curvature-ratio heuristic that allocates more rank to layers whose local activation geometry has higher (Ricci-style) curvature: \(k_\ell = \mathrm{clip}\bigl(\,\mathrm{round}(k_\text{target}\cdot \mathrm{MCR}_\ell/\overline{\mathrm{MCR}}),\;k_\text{floor},\;K_\text{max}\bigr)\) with \(K_\text{max}{=}1536\) (AXEX_MANIFOLD_K_MAX) and \(k_\text{floor}\approx 0.55\,d\). The floor matters more than the cap; below \(\approx 0.4\,d\) on small models we observe text-quality collapse.

MCR-driven allocation is the part of GP we are least confident about. It is enabled by default, but the companion-paper measurements all use shared rank for a clean calibration-free claim. The remainder of this section reports the matched-budget A/B and a confound caused by the weight-PCA basis path.

MCR vs. shared-rank A/B (Llama-3.1-8B, 2026-04-29).

We ran the matched-budget A/B at \(k_\text{target}{=}1024\) on Llama-3.1-8B-Instruct Q4_K_M (RTX 4070 Laptop, \(-n\,64\), deterministic, \(n{=}4\) reps + cold-cache warmup) twice. An initial pass conducted with unrelated background workloads on the same machine reported \(37.93\pm 0.05\) tok/s baseline vs. \(37.10\pm 1.05\) tok/s MCR (\(-2.2\%\), \(21\times\) higher MCR variance). A clean repeat with no co-resident processes (verified by nvidia-smi idle pre-launch and process audit) yields baseline \(39.28\pm 0.05\) tok/s vs. MCR \(39.23\pm 0.05\) tok/s, a \(-0.13\%\) delta within run-to-run noise, \(1{\times}\) variance. The variance gap in the first run was the contamination signal; both means rose by \(\sim 1.4\) tok/s once the machine was quiet, confirming \(\sim 3.5\%\) throughput was being eaten by the unrelated workload. We therefore report MCR at this rank as statistically indistinguishable from the shared-rank baseline, not worse. The interpretive question remains: at \(k{=}1024\) the shared-rank already sits above the cache-fit knee, so per-layer rebalancing has little room to act, the right test is a sweep across lower base ranks \(k{\in}\{256, 512, 768\}\) where the knee is closer. Raw CSVs at docs/figures/paper-b/mcr_ablation.csv (clean) and mcr_ablation_run1_contaminated.csv (preserved for audit).

Weight-PCA flat-fallback finding (2026-04-29).

A second measurement reveals an interaction between MCR and the activation-free --axex-compress basis (weight-PCA mode, default since v0.6.0). MCR computes the layer-wise curvature ratio from the activation phase tags produced during the calibration forward pass; in weight-PCA mode no calibration pass is performed, so the activation variance vector is identically zero. The runtime correctly detects this ([MCR] Flat activation-variance profile, uniform rank recommended, phases_valid=0/32) and falls back to a uniform \(k_\ell{=}K_\text{max}\) allocation, which is identical to the shared-rank baseline. The implication is that MCR's per-layer rebalancing is only active when paired with --axex-actaware-pca; pure weight-PCA users get a uniform allocation regardless of whether --axex-mcr is set. We disclose this explicitly because it confounds any naive comparison between MCR and baseline runs that share the weight-PCA basis: in that pairing the A/B is a no-op by construction, not a null result. Run log: docs/figures/paper-b/v02_empirical/mcr_8b_ppl.log.

Sink-Basis Bypass

StreamingLLM (Xiao et al. 2023) preserves a handful of sink tokens in the KV cache because the first few positions hoard disproportionate attention mass and L2 norm. To our knowledge, no prior work preserves the corresponding sink direction in the weight/activation PCA basis. GP does.

Let \(\hat s\in\mathbb{R}^d\) be the unit vector of the dominant residual-stream direction at the sink position (we estimate it as the top eigenvector of \(\mathbb{E}[h_0 h_0^\top]\) over a calibration-free 32-token zero-prompt forward). Given the per-layer projector \(V_{r}^{(\ell)}\) produced by MCR, define the sink coverage \[\begin{equation} \eta_\ell \;=\; \max_{1\le k\le r}\;\bigl|\langle\hat s,\,V_{r}^{(\ell)}[:,k]\rangle\bigr|. \end{equation}\] Whenever \(\eta_\ell < \varepsilon_{\mathrm{sink}}\) (default \(\varepsilon_{\mathrm{sink}}=0.15\)) the runtime appends \(\hat s\) as an extra column and re-orthonormalises: \[\begin{equation} V_{r+1}^{(\ell)} \;\leftarrow\; \mathrm{QR}\!\bigl([\,V_{r}^{(\ell)}\;\;\hat s\,]\bigr). \end{equation}\] This adds at most one rank per layer and bounds the sink-step reconstruction error by \(\|h_0\|\sin\theta(\hat s, V_{r+1}^{(\ell)})\), which is zero by construction. Empirically the bypass fires on \(3{-}7\%\) of layers and eliminates the rare "first-token spike" in perplexity that shared-rank GP otherwise exhibits at very low \(k\).

Geometry Cache and Depth-Sink Shortcut

Building GP bases from scratch on Llama-3.1-8B takes \(\sim\)70 s on the reference GPU; on the 70B target it is extrapolated at \(\sim\)70 min. The runtime persists two artifacts:

  1. ott_geometry.bin: per-layer spectra, axiom set, depth-sink index, intrinsic-dim estimate, weight-blob hash; a few MB.

  2. \(W_\text{proj}\) cache: projected weights, \(\sim 1{,}093\) MB on Llama-3.1-8B at \(k{=}1536\).

Both caches are keyed on \((\textsf{model digest},\textsf{flag set},k)\).

The Two-Thirds Dimensionality Saturation Rule

Reloading the full geometry cache on a 70B model is not free, but we only need to verify that the cache is not stale. The runtime defines the depth-sink as the single layer where the cumulative explained-variance curve of the residual stream first flattens to within \(10^{-3}\) of its plateau. Across the four open transformers we have inspected, this layer sits near depth \(\ell^\star\approx 2L/3\):

model layers \(L\) depth-sink \(\ell^\star\)
Llama-3.1-8B 32 21
Gemma-2-2B 26 19
Phi-3.5-mini 32 22
Qwen-2.5-7B 32 22

We call this empirical observation the Two-Thirds Dimensionality Saturation Rule. The name is ours and the observation is, to our knowledge, first reported here. The rule is exploited as an \(O(1)\) proxy for the otherwise \(O(L)\) cache-integrity check: the integrity probe reads only the depth-sink layer's spectrum and recomputes it on the fly (under \(200\) ms on Llama-3.1-8B against about \(70\) s for the full sweep).

The data motivating the rule are these four spectra. In each model the cumulative explained variance of the residual stream rises monotonically from layer \(0\), accelerates through the early Mix phase, and settles into a plateau where successive layers add less than \(10^{-3}\) of total energy. The first layer at the plateau is the depth-sink. Across the four architectures the ratio \(\ell^\star/L\) is \(\{21/32,\,19/26,\,22/32,\,22/32\}=\{0.656,\,0.731,\,0.688,\,0.688\}\), which clusters near \(2/3\) with a spread of about \(0.04\) either side. We interpret the regularity as a property of the residual-stream geometry: the early Mix layers populate the basis, the middle Compress layers add nearly orthogonal directions, and once the basis saturates the later Refine layers reuse existing directions rather than generating new ones. The depth-sink is therefore the first layer at which the residual stream stops growing dimensionally, which is also the earliest layer whose spectrum is sensitive to corruption of any upstream basis. That makes its hash a sufficient probe for cache staleness.

The rule is a property of these four models, not a theorem. We expect it to hold approximately on architectures whose residual stream is rank-saturating, and to break on mixture-of-experts and on extremely shallow stacks where saturation never completes. The 70B Llama-3.1 run queued for v0.2 will add a fifth point. If the ratio holds within \(\pm 0.05\) across that addition we will keep the rule as a default setting and expose the depth-sink layer index as a tunable for out-of-distribution architectures. If it fails, the integrity probe falls back to a full-sweep check, costing the seventy seconds.

Static-GP measurement on Llama-3.1-8B

GP at attention-only-shared-rank reproduces the companion-paper configuration, so its end-to-end numbers are reproduced here under the GP framing for self-containment. Hardware is an RTX 4070 Laptop with \(32\) MB of L2.

End-to-end GP numbers on Llama-3.1-8B-Instruct Q4_K_M.
Configuration Decode (%base) PPL \(W_\text{proj}\) disk
Baseline 100.00 6.7902 n/a
GP attn-only \(k{=}1024\) 106.27 10.96 (+61.4%) 729 MB
GP attn-only \(k{=}1536\) 97.55 7.69 (+13.30%) 1,093 MB
GP attn-only \(k{=}2048\) requested 101.04 7.69 (+13.30%) 1,093 MB

The \(k{=}2048\) row produces identical PPL to \(k{=}1536\) because AXEX_MANIFOLD_K_MAX silently caps the stored basis; it also matches the disk footprint.

Adaptive Extensions

The four mechanisms below address four distinct invariance failures of static rank-\(k\) GP, are mutually orthogonal in their state, and each is zero- or near-zero-overhead at inference time. The contribution is the combination; each individual mechanism has antecedents in the literature.

Four invariances assumed by static rank-\(k\) GP, their empirical violations, and the GP extension that addresses each. Invariance: an assumption that, if it held, would justify a single global rank chosen once at calibration time. Violation: a measured property of deployed transformers that breaks the assumption. Mechanism: the GP component that drops the assumption and tracks the measured property at inference time.
Invariance Violation Mechanism
All layers compress equally well at the same rank. Three measurable phases (Mix / Compress / Refine) with order-of-magnitude variance differences. 1
The SVD spectrum is intrinsic to \(W\). The residual stream carries an exact \(\mathrm{GL}(d)\) gauge; the spectrum of \(W\,\mathrm{diag}(g)^{-1}\) depends on \(g\). 2
Optimal rank is hardware-independent. Consumer GPUs throttle at \(\sim 85\,^\circ\mathrm{C}\) within \(\sim 30\) s; \(J/\text{tok}\) doubles in the throttled regime. 3
Calibration distribution matches inference distribution. Speculative-decode rejections directly measure drift. 4

MCR Phase-Aware Rank Allocation

For each layer \(\ell\), compute the per-feature variance of captured hidden states \(\sigma^2_\ell = \E[\norm{h_\ell-\bar h_\ell}^2]/d\), smooth with a 3-tap moving average \(\widetilde\sigma^2_\ell\), and let \(\sigma^2_{\min}=\min_\ell\widetilde\sigma^2_\ell\). The phase label is \[\mathrm{phase}(\ell) = \begin{cases} \mathrm{Compress} & \widetilde\sigma^2_\ell \leq c_\text{thr}\sigma^2_{\min}\\ \mathrm{Mix} & \ell < \ell_\text{compress\_start}\\ \mathrm{Refine} & \ell > \ell_\text{compress\_end} \end{cases}\] with default \(c_\text{thr}=1.5\). Given a global budget \(R=r_\text{base}\cdot L\), \[r_\ell = \mathrm{clip}\bigl(s_{\mathrm{phase}(\ell)}\cdot r_\text{base},\;r_\text{min},\;r_\text{max}\bigr),\] default scales \(s_\text{Mix}{=}1.5\), \(s_\text{Compress}{=}0.35\), \(s_\text{Refine}{=}1.2\), then renormalised so total budget matches \(R\).

Attention-sink bypass.

Sink tokens (BOS in particular) carry residual-stream norms several \(\sigma\) above the mean. StreamingLLM (Xiao et al. 2023) preserves sink tokens in the KV cache; we preserve the sink direction in the basis itself. When \(\max_k\inner{\hat s}{v_k}\) is small for the dominant sink direction \(\hat s\), we append \(\hat s\) to \(V_r\) before initialisation, costing one rank slot per affected layer.

Predicted gain.

At fixed total budget the predicted PPL improvement scales with \(\sigma^2_\text{Mix}/\sigma^2_\text{Compress}\), typically \(\sim 5\times\) in published variance profiles. Whether the prediction holds is the v0.2 measurement.

Axiom Gauge: \(\mathrm{GL}(d)\) Gauge-Optimal Compression

This section is a v0.2 design specification. The gauge-optimal basis is derived and the algorithm is given. The end-to-end PPL and throughput ablation against MCR-only is deferred to a companion v0.2 release. The transformer residual stream carries an exact gauge symmetry. For any invertible \(G\in\mathrm{GL}(d)\), the substitution \(x\!\mapsto\!Gx,\ W^\text{read}\!\mapsto\!W^\text{read}G^{-1},\ W^\text{write}\!\mapsto\!GW^\text{write}\) leaves outputs unchanged, but the SVD spectrum of \(W^\text{read}G^{-1}\) depends on \(G\) unless \(G\) is orthogonal. Static rank-\(r\) schemes do not exploit this freedom.

Construction.

Restrict to diagonal \(G=\diag(g)\), \(g\in\R^d_{>0}\), parameterise \(\lambda=\log g\) to keep \(g>0\), and minimise the joint tail energy \[\mathcal{L}(g)=\!\!\sum_{\ell,W\in\text{reads}_\ell}\!\!\mathrm{tail}_r\!\bigl(W\diag(g^{-1})\bigr) +\!\!\sum_{\ell,W\in\text{writes}_\ell}\!\!\mathrm{tail}_r\!\bigl(\diag(g)\,W\bigr),\] where \(\mathrm{tail}_r(M)=\norm{M}_F^2-\norm{\mathrm{trunc}_r(M)}_F^2\). The gradient in log-space is sparse and analytic: \[\partial_{\lambda_i}\mathrm{tail}_r(W\diag(e^{-\lambda}))=-2\norm{X_\text{tail}[:,i]}^2,\quad \partial_{\lambda_i}\mathrm{tail}_r(\diag(e^{\lambda})W)=+2\norm{X_\text{tail}[i,:]}^2.\] Both follow from \(\norm{M}_F^2=\trace(M^\top M)\). 10--30 outer iterations suffice; we normalise so \(\prod_i g_i^{1/d}=1\) to remove the global scale.

Bakeable, zero-overhead inference.

Once \(g^\star\) is found it is baked into the compressed factors before GPU upload: stored \(\widetilde V_t[k][i]=S_kV_{k,i}\,g_i\) (read), \(\widetilde U[i][k]=U_{i,k}/g_i\) (write). The inference path, the two-GEMV \(\mathrm{tmp}=V_r^\top x\), \(\mathrm{out}=U_r\mathrm{tmp}\), is unchanged. This is the cleanest possible deployment: a free pre-compute step at calibration time, no runtime branch, no extra memory traffic.

Predicted gain.

For models with LayerNorm-induced uniform feature scales the gain is modest (\(\lesssim 5\%\) tail energy); for models with strong per-feature scale asymmetry, \(\geq 20\%\) is plausible.

Thermal Rank and Tokens-per-Joule

This section gives the thermal control law and the tokens-per-joule objective. A 90-second sustained-decode telemetry trace on Llama-3.1-8B Q4_K_M is reported below. A closed-loop A/B against an unthrottled baseline (rank held fixed at \(r_\text{max}\) while the controller is permitted to actuate) remains the next step.

On the reference RTX 4070 Laptop, sustained Llama-3.1-8B inference reaches \(\sim 85\,^\circ\mathrm{C}\) within \(\sim 30\) s and the boost clock is clamped; tok/s falls to 50--60% of cold-start. Earlier work clamps the clock proactively; we clamp the compression rank instead.

Linear interpolation rule.

With temperature thresholds \(T_\text{low}{=}65\,^\circ\mathrm{C}\), \(T_\text{high}{=}85\,^\circ\mathrm{C}\) and rank bounds \(r_\text{min},r_\text{max}\), \[r(T)=\mathrm{clip}\!\left(r_\text{max}-(r_\text{max}-r_\text{min})\cdot \frac{T-T_\text{low}}{T_\text{high}-T_\text{low}},\;r_\text{min},r_\text{max}\right).\] NVML is loaded dynamically; if unavailable the rank reverts to base. Polling is rate-limited to once per 500 ms. An optional power cap \(P_\text{budget}\) further reduces rank when current draw exceeds the cap. The system is closed-loop stable as long as the heat-generation rate at \(r_\text{min}\) is below the cooling capacity at \(T_\text{high}\).

Tokens-per-joule diffplan term.

The differentiable rank plan of (Robbins and Monro 1951) style minimises reconstruction error with an \(\ell_1\) rank penalty. We add an explicit energy term: with \(J=P_\text{NVML}/(\text{tok/s})\) joules/token and a slowly-estimated coefficient \(\hat c_\text{rank}\) (joules per unit-rank per token), \[\Delta_\text{TPJ}\,\nabla_\ell^{(r)}= \lambda\,\hat c_\text{rank}\,p_\ell^{(r)}\bigl(R^{(r)}-r^\text{soft}_\ell\bigr), \quad r^\text{soft}_\ell=\sum_r p_\ell^{(r)}R^{(r)}.\] The plan-level objective becomes \(\mathcal L_\text{recon}+\alpha\norm{r}_1 +\lambda\,\hat c_\text{rank}\,r^\text{soft}\), a weighted blend of accuracy, headroom, and energy.

Predicted gain.

At \(r{=}1024\) baseline tok/s drops \(\sim 40\%\) once throttling engages. Thermal Rank avoids throttle altogether at the cost of a controlled rank reduction; if rank--PPL elasticity is \(<1\%\) per rank-step in the operating regime (typical for \(r\in[768,1280]\) on Llama-3.1-8B), the trade is favourable.

Telemetry trace, sustained decode (2026-04-29).

This trace is a one-sided telemetry capture of an unmodified frozen-rank run, not a closed-loop A/B against the Thermal Rank controller. It characterises the operating envelope the controller would see; the paired controller-on vs. controller-off ablation is listed as an open measurement in [sec:limitations]. We instrumented a 4-pass back-to-back decode of Llama-3.1-8B-Instruct Q4_K_M (4 \(\times\) 384 tokens, \(T{=}0.7\), RTX 4070 Laptop) with a 1 Hz nvidia-smi probe on the channels Thermal Rank consumes (temperature.gpu, clocks.sm, power.draw) for \(\sim\) 90 s wall-clock. Of 94 samples, 86 fell in the active band (util \(\geq\) 20 %): GPU temperature reached \(T_\text{max}{=}75\,^\circ\mathrm{C}\) with \(T_{p90}{=}74\,^\circ\mathrm{C}\) and mean \(61.2\,^\circ\mathrm{C}\); SM clock held at the boost ceiling (median 2235 MHz, max 2235 MHz); power drew a mean of 66.3 W with peak 116 W. Decode throughput was 25.8 \(\to\) 28.0 tok/s across the four passes (a +8.5 % warm-up rather than throttle), giving an estimated 0.415 tokens/joule at the active-window mean. The active-window p90 of \(74\,^\circ\mathrm{C}\) enters the Thermal Rank actuation band [\(T_\text{low}{=}65,\,T_\text{high}{=}85\)] \(^\circ\mathrm{C}\) but does not saturate it, so on this hardware/load pair the controller would operate at intermediate rank rather than \(r_\text{min}\) or \(r_\text{max}\) alone, the regime in which closed-loop measurements are most informative. Raw trace: docs/figures/paper-b/v02_empirical/thermal_sustained_8b.csv and summary at thermal_summary.json.

Online Basis: Rejection-Driven Oja

The rejection trigger and the decayed Oja update are implemented in the runtime (the onb_record_residual and onb_apply_pending hooks in the speculative-decode verifier path) and are gated by --axex-online-basis. A long-horizon drift ablation against a frozen-basis baseline is in progress and will appear in the next revision.

A static PCA basis fitted on calibration data drifts out of validity as inference moves into uncovered distributions. Speculative decoding (Leviathan et al. 2023; Chen et al. 2023) provides a free measurement of drift: every rejection is a position where the draft and the verifier disagreed, so the residual \(h^\text{tgt}-h^\text{drft}\) is a sample of the direction the current basis fails to capture.

Why Oja, not SGA or Krasulina.

Three online streaming-PCA estimators are candidates. Stochastic gradient ascent on the Rayleigh quotient (Shamir 2015) requires explicit orthogonalisation against all already-converged components, costing \(O(k^2 d)\) per update and a global synchronisation across rows. Krasulina's rule (Krasulina 1969) converges faster than Oja's on quadratic losses but needs an estimate of the running mean to subtract, which requires a second moving statistic per layer. Oja's rule needs neither: the orthogonalisation is implicit in the row-normalisation step, and the rule is correct under the centred residual that speculative rejections already provide (\(\E[h^\text{tgt}-h^\text{drft}]\!\to\!0\) on a well-trained draft model). The state per layer is exactly \(W_t\) plus a bump counter, and the per-update cost is \(O(kd)\).

Decayed Oja update.

Write the projection basis at step \(t\) as \(\Pmat_t=W_tW_t^\top\in\R^{d\times d}\), the orthogonal projector onto the row-span of \(W_t\). Oja's rule lifts this to a per-row update on \(W_t\) itself: for layer \(\ell\) with basis \(W_t\in\R^{k\times d}\) and sample \(x=h^\text{tgt}-h^\text{drft}\), \[\begin{equation} w_i^{(t+1)} \;\leftarrow\; w_i^{(t)} + \eta_t\,x\,(x^\top w_i^{(t)}), \qquad w_i^{(t+1)} \;\leftarrow\; w_i^{(t+1)}/\norm{w_i^{(t+1)}}, \quad i=1,\dots,k, \label{eq:oja} \end{equation}\] deflated row-wise so the matrix remains orthonormal. The induced map on the projector is \(\Pmat_{t+1}=\Pmat_t+\eta_t\bigl(xx^\top-\Pmat_t xx^\top\Pmat_t\bigr)+O(\eta_t^2)\), i.e. a rank-one tilt of \(\Pmat_t\) toward the residual direction \(x\) at rate \(\eta_t\). The learning-rate schedule \(\eta_t=\eta_0/\sqrt t\) satisfies the Robbins--Monro conditions \(\sum\eta_t=\infty\), \(\sum\eta_t^2<\infty\) (Robbins and Monro 1951); default \(\eta_0=0.01\). Convergence to the leading eigenvectors of the running covariance is classical (Oja 1982); what is new is the trigger, only rejection residuals are fed in, by construction the directions the basis fails to capture.

Convergence rate and bias floor.

For a stationary covariance \(\Sigma\) with eigengap \(\Delta=\lambda_k-\lambda_{k+1}\), Oja's rule with the \(\eta_t=\eta_0/\sqrt t\) schedule attains \(\E\norm{\Pmat_t-\Pmat^\star}_F^2 = O(d/(\Delta^2 t))\) after a burn-in (Allen-Zhu and Li 2017; Jain et al. 2016). For the depth-sink layer of Llama-3.1-8B at \(k{=}1024\) the eigengap on the activation covariance is \(\Delta\approx 6\times 10^{-3}\) (raw spectrum at docs/figures/paper-b/v02_empirical/depth_sink_spectrum.csv), so \(t\!\approx\!10^4\) samples reduce projector error to \(\sim 5\times 10^{-4}\), smaller than the calibration-PCA truncation error at the same rank (\(\sim 10^{-3}\)). Under the rejection-only trigger the effective sample rate is the speculative-decode rejection rate; on the reference run this is \(\approx 12\%\) of decoded tokens, so \(10^4\) updates corresponds to roughly \(80\,000\) tokens of decoded text, on the order of a single long document. Beyond the burn-in the bias floor is \(\eta_t\,\norm{\Sigma}\!\sim\!10^{-4}\) at \(\eta_0{=}0.01\), well below the truncation floor.

Interaction with rejection-sampling unbiasedness.

Speculative decoding is exactly correct under rejection sampling (Leviathan et al. 2023); rejected positions sample the same target distribution as accepted ones. The residual \(h^\text{tgt}-h^\text{drft}\) at a rejection is therefore an unbiased measurement of the direction along which the draft model's hidden state diverges from the verifier's, and the GP basis is fitted on the verifier's activations, so feeding only rejection residuals into Oja's rule is equivalent to running Oja's rule on samples of \(\Sigma_\text{drift}=\Sigma_\text{verifier}-\Sigma_\text{calibration}\). The result is a basis that tracks the calibration-to-deployment delta, not the absolute distribution; the calibration component is held by the static rows of \(W_t\) already.

Per-layer staleness.

A bump-counter onb_layer_state_t::version is incremented on every applied update. GP tracks the version it last consumed; if a layer's version moves ahead, the cached projection is recomputed lazily on the next forward. Updates apply between decode steps, queue bounded to 256, with a 4-rejection minimum gate.

Predicted behaviour and v0.2 measurement plan.

On matched calibration/test, online updates should be a no-op: rejection residuals concentrate near the existing basis and the \(xx^\top-\Pmat_t xx^\top\Pmat_t\) correction vanishes. On linearly drifting OOD prompts, the Oja-decayed estimator converges to the running covariance with bias \(\propto\eta_t\). The v0.2 measurement plan is a paired A/B on a synthetic-drift corpus: one stream of Wikipedia text (matched-distribution control) and one stream of post-2024 code mixed with non-English news (deliberate drift), each \(\sim 10^5\) decoded tokens, running frozen-basis vs. Oja-basis and reporting PPL trajectory plus the Frobenius distance \(\norm{\Pmat_t-\Pmat_0}_F\) as a function of \(t\). A visual of the drift-distance trajectory will accompany the v0.2 release; the empty trajectory plot would be premature here and is intentionally omitted.

Composition, Limitations, Reproduction

How the Four Mechanisms Compose

The four extensions touch disjoint parts of the pipeline: MCR changes the per-layer \(r_\ell\) before basis construction; Axiom Gauge changes the basis itself by left-/right-conjugating with \(\diag(g)\); Thermal Rank changes the truncation threshold at inference time; and Online Basis changes the basis between decode steps in response to drift. Their state is non-overlapping (rank vector, gauge vector, thermal scalar, version counter), and their update intervals are very different (one-shot, one-shot, 500 ms, per-rejection). All four are runtime-flag-gated and individually disabling.

Limitations

  • The static-GP measurement is single-model, single-GPU, single quantisation (Llama-3.1-8B-Instruct Q4_K_M, RTX 4070 Laptop, dequantised to fp32). No cross-hardware or cross-model end-to-end measurement is included for the adaptive layers.

  • All four adaptive extensions are design-validated: code paths exist, runtime flags work, but the v0.2 measurement sweep is the next milestone. Predicted gains in [sec:adaptive-mcr,sec:gauge,sec:thermal,sec:online] are upper-bound estimates from closed-form arithmetic, not measurements.

  • The \(W_\text{proj}\) cache is fp32; converting to Q8_0 (Gerganov and llama.cpp contributors 2023) would close the disk-footprint gap but re-quantises an already-quantised weight, and behaviour at the second quantisation boundary is uncharacterised.

  • Cache invalidation forces a full rebuild (\(\sim 70\) min for the 70B target), making model upgrades an event rather than a transparent rolling deploy.

  • FFN-only configurations are not yet quantified end-to-end. A preliminary single-prompt run at \(k{=}1024\) on Llama-3.1-8B-Instruct Q4_K_M (RTX 4070 Laptop) produced no measurable speed-up over baseline; this is consistent with the \(k_{95}\) spectrum analysis (FFN \(k_{95}/d{\approx}0.85\) vs. attention \(k_{95}/d{\approx}0.41\), see companion paper (Stewart 2026) Section "Spectral evidence for compressing attention only"), which predicts that the FFN spectrum is too flat for low-rank substitution to be a net win at this rank. We flag this as the spectrum predicting a null result rather than a characterised regression: a full FFN-only sweep at \(k\in\{512, 768, 1024, 1536\}\) with \(n_\text{prompts}{=}8\) and \(\geq 6\) repetitions per cell is a v0.2 milestone. Until then we do not claim a regression number, only that the design rationale of GP (attention-first) is consistent with the spectrum.

Reproduction

static-GP measurement: see the companion paper's reproduction recipe at https://github.com/NagusameCS/HyperTensor/blob/main/repro/REPRODUCE.md. Cross-model intrinsic-dim numbers are reproducible via:

# intrinsic-dim re-run for any model under legacy/axiom_vis/
geodessical <model.gguf> --axex-axiom-only --axex-export-vis <outdir>
# outputs <outdir>/phase1_manifold.json

The four adaptive extensions are reachable via: --axex-mcr, --axex-gauge, --axex-thermal-rank, --axex-online-basis, with sub-flags documented in runtime/nn/axiom_exploit.h.

Ainslie, Joshua, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. https://arxiv.org/abs/2305.13245.
Allen-Zhu, Zeyuan, and Yuanzhi Li. 2017. "First Efficient Convergence for Streaming k-PCA: A Global, Gap-Free, and Near-Optimal Rate." Proc. FOCS.
Chen, Charlie, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. "Accelerating Large Language Model Decoding with Speculative Sampling." arXiv Preprint. https://arxiv.org/abs/2302.01318.
Eckart, Carl, and Gale Young. 1936. "The Approximation of One Matrix by Another of Lower Rank." Psychometrika 1 (3): 211--18.
Elhage, Nelson, Tristan Hume, Catherine Olsson, Nicholas Schiefer, et al. 2022. Toy Models of Superposition. Anthropic Transformer Circuits Thread.
Gerganov, Georgi, and llama.cpp contributors. 2023. llama.cpp k-Quants (Q4_K_M, Q5_K_M, Q6_K) Format Specification. Https://github.com/ggerganov/llama.cpp/pull/1684.
Geva, Mor, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. "Transformer Feed-Forward Layers Are Key--Value Memories." EMNLP.
Golub, Gene H., and Charles F. Van Loan. 2013. Matrix Computations. 4th ed. Johns Hopkins University Press.
Jain, Prateek, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, and Aaron Sidford. 2016. "Streaming PCA: Matching Matrix Bernstein and Near-Optimal Finite Sample Guarantees for Oja's Algorithm." Proc. COLT.
Krasulina, T. P. 1969. "The Method of Stochastic Approximation for the Determination of the Least Eigenvalue of a Symmetrical Matrix." USSR Comp. Math. Math. Phys. 9 (6): 189--95.
Leviathan, Yaniv, Matan Kalman, and Yossi Matias. 2023. "Fast Inference from Transformers via Speculative Decoding." ICML.
Oja, Erkki. 1982. "Simplified Neuron Model as a Principal Component Analyzer." Journal of Mathematical Biology 15 (3): 267--73.
Robbins, Herbert, and Sutton Monro. 1951. "A Stochastic Approximation Method." The Annals of Mathematical Statistics 22 (3): 400--407.
Shamir, Ohad. 2015. "A Stochastic PCA and SVD Algorithm with an Exponential Convergence Rate." Proc. ICML.
Stewart, William Ken Ohara. 2026. "Geodesic Runtime Compression: A Calibration-Free, Super-Baseline Attention Compression." HyperTensor Companion Paper, to Appear.
Xiao, Guangxuan, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient Streaming Language Models with Attention Sinks. https://arxiv.org/abs/2309.17453.