Adaptive Compression: Phase, Gauge, Thermal, and Online Basis, HyperTensor

Scope , read first

This paper documents the design and runtime implementation of four adaptive-compression mechanisms layered on top of the static GP/GRC compression of Paper 1 and the geodesic-projection runtime of Paper 2:

MCR , Mix / Compress / Refine: per-layer rank allocation driven by an empirical phase profile of the residual stream plus an attention-sink bypass.
Axiom Gauge: gauge-optimal compression via the GL($d$) residual-stream symmetry, baked into the compressed weights at zero inference overhead.
Thermal Rank + Tokens-per-Joule: NVML-driven rank scaling against thermal headroom, plus an energy-aware policy gradient added to the differentiable rank plan.
Online Basis: Oja's-rule rank-1 PCA updates fired by speculative-decode rejection events, with per-layer staleness versioning.

What this paper does not contain: measured PPL or tok/s sweeps for any of the four mechanisms in isolation or in combination. Each mechanism is implemented and reachable from the host binary via a CLI flag; a sweep over (rank, scale, $\eta_0$, temp threshold) on the Paper 1 locked protocol is the obvious next milestone but is not in this revision. The publication threshold is the same as Paper 3 v0.2: real implementation, reviewable today, no fabricated empirical headline. Where a closed-form prediction can be stated, it is.

§0, Abstract

Abstract

Static rank-$r$ SVD compression of a transformer treats every layer, every weight matrix, and every operating point as exchangeable. None of those invariances actually hold: layers fall into one of three empirical phases (mix, compress, refine) with different sensitivity to rank reduction; the residual stream carries an exact GL($d$) gauge symmetry that no static SVD exploits; consumer hardware enters thermal throttling within $30$ seconds of sustained load and the optimal rank changes accordingly; and the inference distribution drifts away from the calibration distribution, so yesterday's PCA basis goes stale. We present four mechanisms shipped in the geodessical runtime that address each invariance failure individually: MCR phase-aware rank allocation with attention-sink bypass, gauge-optimal compression via diagonal $g \in \mathrm{GL}(d)$ optimisation, thermal-budget-coupled rank scaling driven by NVML with a tokens-per-joule objective term, and rejection-triggered online PCA via Oja's rule. All four are reachable from the host binary CLI and individually composable with the GP and OTT pipelines. The contribution is the combination: each mechanism targets one specific invariance failure, all four are zero- or near-zero-overhead at inference time, and none requires retraining. The first measurement sweep is the v0.2 milestone.

§1 , Why static rank fails

Four invariance failures of one-rank-fits-all

The simplest GP compression replaces every weight matrix $W$ with a rank-$r$ factorisation $W \approx U_r S_r V_r^\top$ at a single fixed $r$. This works as a baseline (Paper 1) but bakes in four assumptions, each of which the literature now contradicts.

Invariance assumed by static rank-$r$	Empirical violation	Paper 6 mechanism
All layers compress equally well at the same rank	Queipo-de-Llano et al. (Oct 2025) show three measurable phases (Mix / Compress / Refine) with order-of-magnitude variance differences	§2 MCR rank allocation
The SVD spectrum is intrinsic to $W$	The residual stream has an exact GL($d$) gauge; the spectrum of $W \cdot \mathrm{diag}(1/g)$ depends on $g$	§3 Axiom Gauge
Optimal rank is hardware-independent	Consumer GPUs thermal-throttle at $\sim 85\,^\circ\mathrm{C}$ within $\sim 30\,\mathrm{s}$ of sustained load; energy-per-token doubles in throttled regime	§4 Thermal Rank + Tokens-per-Joule
The calibration distribution matches the inference distribution	Speculative-decode rejection events are a direct measurement of distribution drift	§5 Online Basis (rejection-driven Oja)

Each row is a separate research thread in the existing literature; their combination here is not. The novelty of this paper is not any single mechanism , each has antecedents , but the orchestration: four invariances fail in different ways, each with a known fix, and all four fixes can be made zero-cost at inference time. We describe each in turn.

§2 , MCR phase-aware rank allocation

Detect the three layer phases, allocate accordingly

Queipo-de-Llano et al. (Oct 2025) showed that transformer layers process tokens in three measurable phases driven by the BOS attention sink:

Mix (early): broad token mixing, high activation variance. Compression here is destructive because it discards the cross-token representation the model is actively building.
Compress (middle): the model itself collapses attentional mass onto a small set of directions, residual-stream activation variance drops to its global minimum. Low-rank projection here is nearly free , we are replicating what the model already does.
Refine (late): selective task-specific feature extraction. Variance rises; rank matters again, but for different reasons.

A static-rank pipeline ignores the structure. MCR detects phase boundaries empirically from the per-layer activation-variance profile and allocates non-uniform rank.

2.1 , Phase detection

For each layer $\ell$, compute the per-feature variance of the captured hidden states and reduce to a scalar $\sigma_\ell^2 = \mathbb{E}[\,\|h_\ell - \bar h_\ell\|^2\,] / d$. Apply a 3-tap moving average to suppress single-layer noise, denote $\tilde\sigma^2_\ell$ the smoothed variance, and define

\sigma^2_{\min} = \min_\ell \tilde\sigma^2_\ell, \quad \mathrm{phase}(\ell) = \begin{cases} \mathrm{Compress} & \tilde\sigma^2_\ell \le c_\mathrm{thr}\,\sigma^2_{\min} \\ \mathrm{Mix} & \ell < \ell_\mathrm{compress\_start} \\ \mathrm{Refine} & \ell > \ell_\mathrm{compress\_end} \end{cases}

with default threshold $c_\mathrm{thr} = 1.5$ ("within 50% of the global variance minimum"). The implementation lives in runtime/nn/mcr_compress.h; if the variance profile is flat (no clear minimum), the result sets phases_valid = 0 and the runtime falls back to uniform rank.

2.2 , Rank allocation

Given the phase labels and a global rank budget $R = r_\mathrm{base} \cdot L$, allocate per-layer rank

r_\ell = \mathrm{clip}\!\left(s_{\mathrm{phase}(\ell)} \cdot r_\mathrm{base}, \; r_{\min}, \; r_{\max}\right)

with default scales $s_\mathrm{Mix} = 1.5$, $s_\mathrm{Compress} = 0.35$, $s_\mathrm{Refine} = 1.2$ (CLI-tunable via --axex-mcr-{mix,compress,refine}-scale). The result is then re-normalised so the total rank budget matches the user-requested $R$.

2.3 , Attention-sink bypass

Sink tokens (BOS in particular) carry residual-stream L2 norms several standard deviations above the mean. Their large activations may project badly onto the calibration PCA basis, and the resulting reconstruction error grows non-linearly past some compression ratio. StreamingLLM preserves sink tokens in the KV cache; nobody (to our knowledge) preserves the sink direction in the weight/activation PCA basis. We do.

The procedure is simple: detect items with $\|h\|_2 > \mu + 3\sigma$ in the calibration set; check whether the dominant sink direction $\hat s$ is well-covered by the existing basis $V_r$ via $\max_k |\langle \hat s, v_k\rangle|$; if not, append $\hat s$ as an extra basis vector before layer initialisation. The cost is one rank slot per affected layer; the benefit is bounded reconstruction error on sink-dominated steps.

2.4 , Closed-form prediction

If the variance profile is real, MCR redistributes rank from Compress-phase layers (where the marginal value of rank is smallest) to Mix and Refine layers. At a fixed total budget the predicted PPL improvement scales with the ratio $\sigma^2_\mathrm{Mix} / \sigma^2_\mathrm{Compress}$, which is typically $\sim 5\!\times$ in published profiles , an order-of-magnitude lever. Whether the prediction holds in practice is the v0.2 measurement.

§3 , Axiom Gauge: GL($d$) gauge-optimal compression

Exploiting the residual-stream symmetry

The transformer residual stream carries an exact gauge symmetry. For any invertible $G \in \mathrm{GL}(d)$, the substitution

x \mapsto G\,x, \quad W^\mathrm{read} \mapsto W^\mathrm{read} G^{-1}, \quad W^\mathrm{write} \mapsto G\,W^\mathrm{write}

leaves model outputs unchanged. "Read" matrices project the residual stream into a subspace (Q, K, V, FFN gate, FFN up); "write" matrices project back (attention output O, FFN down). The SVD spectrum of $W^\mathrm{read} G^{-1}$ depends on $G$ unless $G$ is orthogonal , so the $r$-rank truncation error is gauge-dependent, even though the model is not.

No static rank-$r$ scheme exploits this freedom. Axiom Gauge does. We parameterise $G = \mathrm{diag}(g)$, $g \in \mathbb{R}^d_{>0}$, and minimise the joint tail energy

\mathcal{L}(g) = \sum_{\ell,\; W \in \mathrm{reads}_\ell} \mathrm{tail}_r\!\big(W \mathrm{diag}(g^{-1})\big) + \sum_{\ell,\; W \in \mathrm{writes}_\ell} \mathrm{tail}_r\!\big(\mathrm{diag}(g)\, W\big)

where $\mathrm{tail}_r(M) = \|M\|_F^2 - \|\mathrm{trunc}_r(M)\|_F^2$ is the energy not captured by the top-$r$ SVD.

3.1 , Gradient in log space

Optimisation is done in $\lambda = \log g$ to keep $g > 0$ automatic. For a read matrix $W \in \mathbb{R}^{m \times d}$ with $X = W\,\mathrm{diag}(e^{-\lambda})$:

\frac{\partial\,\mathrm{tail}_r(X)}{\partial \lambda_i} = -2\bigl(\|X[:,i]\|^2 - \sum_{k \le r} S_k^2\,V^\top_{k,i}{}^2\bigr) = -2\,\|X_\mathrm{tail}[:,i]\|^2

For a write matrix $W \in \mathbb{R}^{d \times n}$ with $X = \mathrm{diag}(e^{\lambda})\,W$, the sign flips:

\frac{\partial\,\mathrm{tail}_r(X)}{\partial \lambda_i} = +2\,\|X_\mathrm{tail}[i,:]\|^2.

These are derived directly from $\|M\|_F^2 = \mathrm{tr}(M^\top M)$ and $\partial_{\lambda_i} \mathrm{diag}(e^\lambda) = e^{\lambda_i} E_{ii}$. The gradient is sparse per coordinate, the SVDs are computed once per outer iteration, and 10–30 outer iterations is typically sufficient. After the final $g$ is found we normalise so $\prod_i g_i^{1/d} = 1$ to remove the global scale ambiguity (the residual stream itself is unchanged on average).

3.2 , Zero-overhead inference

Once $g^\star$ is found it is baked into the compressed factors before upload to GPU. For a read matrix factored as $W \approx U_r S_r V_r^\top$, the gauge-aware factor stored as d_Vt[k][i] is $S_k V_{k,i}\,g_i$; for a write matrix, d_U[i][k] becomes $U_{i,k}/g_i$. The inference path , the two-GEMV $\mathrm{tmp} = V_r^\top x$, $\mathrm{out} = U_r\,\mathrm{tmp}$ , is completely unchanged. This is the cleanest possible deployment: a free pre-compute step at calibration time, no runtime branch, no extra memory traffic.

Implementation is in runtime/nn/axiom_gauge.h; the entry point is axex_gauge_optimize, which dequantises any GGUF-quantised model on the fly and only processes the matrices involved in SVD compression. Embedding and norm matrices are skipped because they are not compressed by GP. The reported quantity tail_after / tail_before bounds the achievable PPL gain.

3.3 , Closed-form prediction

The diagonal-gauge optimum reaches a stationary point on the $d$-dimensional manifold; the achievable tail reduction depends on the spread of feature scales across the read/write matrices. For models with LayerNorm-induced uniform feature scales the gain will be modest ($\lesssim 5\%$ tail energy); for models with strong per-feature scale asymmetry (Llama-3 with no per-token RMSNorm-pre-attention scaling, for instance) the gain may exceed $20\%$. The measurement is straightforward and is the §9 v0.2 milestone.

§4 , Thermal Rank and tokens-per-joule

Coupling rank to thermal headroom

Consumer GPUs throttle. On the reference RTX 4070 Laptop, sustained Llama-3.1-8B inference reaches steady-state $\sim 85\,^\circ\mathrm{C}$ within $\sim 30$ seconds and the driver clamps the boost clock; tok/s falls to 50–60% of the cold-start measurement. Earlier work (throttLL'eM, HPCA 2025) responds by clamping the clock proactively. We respond by clamping the compression rank instead.

4.1 , NVML-driven rank scaling

The mechanism is a linear interpolation. With temperature thresholds $T_\mathrm{low}$ and $T_\mathrm{high}$ and rank bounds $r_{\min}, r_{\max}$:

r(T) = \mathrm{clip}\!\left(r_{\max} - (r_{\max} - r_{\min}) \cdot \frac{T - T_\mathrm{low}}{T_\mathrm{high} - T_\mathrm{low}}, \; r_{\min}, r_{\max}\right).

Defaults are $T_\mathrm{low} = 65\,^\circ\mathrm{C}$, $T_\mathrm{high} = 85\,^\circ\mathrm{C}$, $r_{\min}, r_{\max}$ user-supplied. NVML is loaded dynamically (nvml.dll / libnvidia-ml.so); if NVML is unavailable, nvml_ok = 0 and thermal_get_rank returns the base rank unchanged. Polling is rate-limited to once per poll_interval_ms = 500 ms by default to avoid syscall overhead. An optional $P_\mathrm{budget}\,$ W cap further reduces rank when current draw exceeds the cap.

The contract is monotone: higher temperature implies fewer FLOPs per token implies less heat generated per token. The system is closed-loop stable as long as the heat-generation rate at $r_{\min}$ is below the cooling capacity at $T_\mathrm{high}$, which is the engineering definition of "the laptop does not melt". On hardware where this is not true, $r_{\min}$ should be set lower; this is a configuration question, not a correctness question.

4.2 , Tokens-per-joule as a diffplan objective

The differentiable rank plan of Paper 4's geo_research module minimises reconstruction error with an L1 rank penalty. That optimises accuracy at fixed rank, not accuracy at fixed energy cost. The TPJ module adds an explicit energy term to the plan-level gradient. With observation $J = P_\mathrm{NVML} / (\mathrm{tok/s})$ in joules per token and a slowly estimated coefficient $\hat c_\mathrm{rank}$ (joules per unit-rank per token), the policy-gradient contribution to the diffplan softmax is

\Delta_\mathrm{TPJ}\,\mathrm{grad}_\ell^{(r)} = \lambda \cdot \hat c_\mathrm{rank} \cdot p_\ell^{(r)} \cdot \bigl(R^{(r)} - r^\mathrm{soft}_\ell\bigr),

where $p_\ell^{(r)}$ is the softmax probability the plan currently places on rank level $r$ at layer $\ell$, $R^{(r)}$ is the rank of that level, and $r^\mathrm{soft}_\ell = \sum_r p_\ell^{(r)} R^{(r)}$ is the expected rank. This is precisely the score-function gradient of the energy penalty with respect to the softmax parameters. The regularisation weight $\lambda$ defaults to $0.005$; bootstrap from a quick power+throughput measurement sets a non-zero $\hat c_\mathrm{rank}$ before any observed data so the first diffplan step already feels the energy gradient.

The objective the runtime optimises after this term lands is therefore not just $\mathcal{L}_\mathrm{recon} + \alpha \|r\|_1$ but $\mathcal{L}_\mathrm{recon} + \alpha \|r\|_1 + \lambda\,\hat c_\mathrm{rank}\,r^\mathrm{soft}$ , a weighted blend of accuracy, headroom, and energy. Implementation in runtime/nn/thermal_rank.h.

4.3 , Closed-form prediction

On a 30 s sustained run with no thermal control, baseline tok/s drops by $\sim 40\%$ once throttling engages (Paper 1 §6). Thermal-Rank avoids the throttle altogether at the cost of a controlled rank reduction; if the rank–PPL elasticity is $\partial\mathrm{PPL}/\partial r < 1\%$ per rank-step in the operating regime (typical for $r \in [768, 1280]$ on Llama-3.1-8B), the trade is favourable. The measurement is a sustained 60-second decode comparing static $r=1024$ against thermal-controlled $r \in [768, 1024]$ on the locked protocol.

§5 , Online Basis: rejection-driven Oja

Coupling basis updates to model divergence

A static PCA basis fitted on calibration data drifts out of validity as inference moves into distributions the calibration did not cover. The drift is invisible to the runtime , until the basis becomes stale enough to cause output errors. Speculative decoding offers a free measurement of drift: every rejection is a position where the draft and verifier disagreed, which means the draft basis projected the prefix into a place the verifier disagreed with. The residual $h^\mathrm{tgt} - h^\mathrm{drft}$ is exactly the direction of basis-induced error.

5.1 , Oja's rule, decayed

Online Basis hooks the rejection path. For a layer $\ell$ with current basis $W \in \mathbb{R}^{k \times d}$ and a new sample $x = h^\mathrm{tgt} - h^\mathrm{drft}$, the Oja rank-1 update is

w_i \leftarrow w_i + \eta_t\,x\,(x^\top w_i), \quad w_i \leftarrow w_i / \|w_i\|, \qquad i = 1, \dots, k

with deflation across the $k$ rows so the result remains orthonormal. The learning rate decays as $\eta_t = \eta_0 / \sqrt{t}$ to satisfy the Robbins–Monro conditions; default $\eta_0 = 0.01$. Convergence to the leading eigenvectors of the running covariance is classical (Oja 1982). What is new here is the trigger: the only samples we feed in are the residuals from rejection events, which by construction are the directions the current basis fails to capture.

5.2 , Per-layer staleness versioning

A bump-counter onb_layer_state_t::version is incremented on every applied update. The GP path tracks the version it last consumed; if the layer's version moves ahead, the cached projection of $W^\mathrm{read}$ by $V_r^\top$ is invalidated and recomputed lazily on the next forward. Because rejections are rare (a few per generated token in a typical run) the recomputation cost is amortised to nothing on the hot path.

Updates are applied between decode steps, not inside them, via onb_apply_pending. The pending queue is bounded (ONB_QUEUE_CAP = 256); if it fills, the oldest entry is dropped. A minimum-rejection gate (min_rejections_before_update = 4) avoids reacting to single spurious rejections.

5.3 , Closed-form prediction

If the calibration distribution covers the inference distribution, online updates are a no-op. If the inference distribution is drifting linearly, the Oja-decayed estimator converges to the running covariance with bias $\propto \eta_t$; for $\eta_0 = 0.01$ and $t = 10^4$ samples the bias floor is $\sim 10^{-4}$ , well below typical PCA truncation error. Concretely we expect online basis to be invisible on Paper 1's matched calibration/test split and to show measurable PPL or acceptance-rate improvement on out-of-distribution prompts. Both are measurable on the §8 recipe.

Reference: Oja (1982); recent runtime applications include OjaKV (2025) for online adaptive PCA on the KV cache. Implementation in runtime/nn/online_basis.h.

§6 , How the four mechanisms compose

Orthogonal axes, near-zero overhead

The four mechanisms target distinct invariance failures and are architecturally orthogonal. MCR allocates rank across layers at calibration time. Axiom Gauge changes the basis the SVD is taken in at calibration time. Thermal Rank scales the chosen rank at run time as a function of hardware state. Online Basis modifies the basis matrix entries at run time as a function of model divergence.

Mechanism	Where it acts	When it acts	Inference overhead	Composition with GP/OTT
MCR	per-layer $r_\ell$ allocation	calibration	zero	changes only the rank schedule
Axiom Gauge	basis entries (factor pre-multiply)	calibration	zero	commutes with MCR; baked into factors
Thermal Rank	active $r$	run time, $\sim 0.5\,\mathrm{s}$ poll	$\mathcal{O}(1)$ NVML call	caps $r$ at $r(T)$ on top of MCR-allocated $r_\ell$
Online Basis	basis matrix $W$ entries	run time, between steps	amortised zero (queue drain)	updates the same $W$ MCR + Gauge produced at calibration

All four can be enabled simultaneously without contention. The runtime composition is: MCR sets $r_\ell$ → Gauge optimises $g$ at that schedule and bakes → Thermal Rank reads $r_\ell$ as ceiling and scales by $T$ → Online Basis incrementally adjusts the resulting $W$ matrix on rejection events. None of the four interferes with the others, by construction.

§8 , Reproducing & flag reference

How to enable each mechanism

All four mechanisms are gated by CLI flags on the host binary:

# MCR (phase-aware rank allocation + sink bypass)
.\build_host\geodessical.exe --model models\smollm2-135m-instruct-q8_0.gguf \
    --axex-compress --axex-compress-rank 1024 \
    --axex-mcr-mix-scale 1.5 \
    --axex-mcr-compress-scale 0.35 \
    --axex-mcr-refine-scale 1.2 \
    [...]

# Axiom Gauge (auto-iter chooses 10 small / 1 large)
.\build_host\geodessical.exe --model ... \
    --axex-compress --axex-compress-rank 1024 \
    --axex-gauge \
    --axex-gauge-iter 0 \
    [...]

# Thermal Rank (NVML required; falls back gracefully if unavailable)
.\build_host\geodessical.exe --model ... \
    --axex-compress --axex-compress-rank 1024 \
    --axex-thermal-low 65 \
    --axex-thermal-high 85 \
    --axex-thermal-power 0 \
    [...]

# Online Basis (only meaningful with speculative decode)
.\build_host\geodessical.exe --model ... \
    --ott-full --ott-speculative \
    --axex-online-basis \
    [...]

All four can be combined; ordering does not matter. The default values are deliberately conservative; the v0.2 sweep will tune each independently against PPL and tok/s on the locked protocol.

§9 , Status

What v0.1 has and what v0.2 will measure

Now landed (v0.1):

Implementation of all four mechanisms in runtime/nn/{mcr_compress,axiom_gauge,thermal_rank,online_basis}.{h,c}.
CLI surface in host/main.c.
Closed-form prediction stated for each mechanism in §§2.4, 3.3, 4.3, 5.3.
Composition argument in §6.

Still missing for v0.2:

MCR phase profile measurement on Llama-3.1-8B and SmolLM2-135M, with the resulting rank allocation table.
Axiom Gauge tail-energy reduction percentage on each model and a PPL sweep at fixed rank.
Thermal Rank sustained-decode measurement (60 s, locked protocol) against static rank baseline.
Online Basis acceptance-rate delta on out-of-distribution prompts in OTT speculative mode.
Combined-stack measurement: MCR + Gauge + Thermal + Online vs static GP, both PPL and tok/s.
Citation pass and full bibliography.

The publication threshold for v0.1 is implementation reality and a stated closed-form prediction for each mechanism. The threshold for v0.2 is one measurement per mechanism plus the combined stack.

§9.5 , Limitations

What v0.1 does not establish

Paper 6 is, by construction, a design paper with a stated closed-form prediction per mechanism rather than a measurement paper. The boundaries below are what the next version is expected to close.

No measurement headline. All four mechanisms are implementation-real and CLI-reachable, but none has a measured end-to-end PPL or tok/s number in this version. The §§2.4 / 3.3 / 4.3 / 5.3 predictions are derivations from the static GP baseline under explicit assumptions, not benchmarks.
Composition is asserted, not measured. §6 argues that the four mechanisms are orthogonal, MCR allocates rank, Gauge rotates the basis, Thermal scales rank with NVML signal, Online Basis updates the basis from rejection events, and therefore stack additively. The assertion is load-bearing and is the v0.2 combined-stack measurement listed in §9.
Thermal Rank depends on NVML. The tokens-per-joule policy gradient assumes a working NVML telemetry path. On hardware where NVML is absent, throttled, or coarse (some laptop SKUs, some cloud instances), the controller falls back to a static-rank schedule; the fallback is implemented but its quality is not measured.
Online Basis assumes rejection events. The Oja's-rule update is fired by speculative-decoding rejections (Paper 3 §5.5). In a non-speculative deployment the trigger does not exist and the basis is static; this is by design but means Paper 6's online claim is gated on the spec path being live.
Gauge optimisation is offline. The Axiom Gauge diagonal-$g$ search runs at build time, not during decode. "Zero inference overhead" refers specifically to the runtime forward pass; it is not a claim about the build-time cost of finding $g$, which scales with the Phase-1 cloud size.
No comparison against AWQ / GPTQ / SmoothQuant. The baseline in this paper is static Geodesic Projection (Paper 2). A head-to-head against quantisation-side adaptive schemes is open work.

None of these block the runtime from being usable today; they block the publication-grade adaptive-compression claim. v0.2 is scoped to close items 1, 2, and 3.

§9.6 , Terms

Where to find definitions

Paper 6 reuses the glossaries in Paper 1 §0.5 (rank $r$/$k$, decode vs prefill, residual stream) and Paper 2 §0.5 (PCA basis, projection slot, geometry cache, depth-sink, MCR/Ricci rank allocation). Speculative-path terms (acceptance rate $\alpha$, draft/verifier, OneDecode, OTT) are in Paper 3 §0.5. Mechanism-specific terms, phase profile, tail energy, gauge group GL($d$), tokens-per-joule, Oja's rule, are introduced inline at first use.

§10 , References

Selected refs

Oja, E., A simplified neuron model as a principal component analyzer, J. Math. Biol., 1982.
Queipo-de-Llano, P., et al., Three Phases of Transformers, Oct. 2025 (LeCun, Bronstein, co-authors).
Xiao, G. et al., Efficient Streaming Language Models with Attention Sinks (StreamingLLM), ICLR 2024.
OjaKV authors, Online Adaptive PCA for KV Cache via Oja's Rule, Sep. 2025.
throttLL'eM authors, Adaptive Frequency Scaling for LLM Inference, HPCA 2025.
RAP authors, Reinforcement-Learning Adaptive Pruning per Request, 2024.
Stewart, W. K. O., Compressing Llama-3.1-8B Attention via Per-Slot SVD, this site, Paper 1, 2026.
Stewart, W. K. O., Geodesic Projection: GP-Compressed Llama Runtime, this site, Paper 2, 2026.