This paper documents the design and runtime implementation of four adaptive-compression mechanisms layered on top of the static GP/GRC compression of Paper 1 and the geodesic-projection runtime of Paper 2:
- MCR , Mix / Compress / Refine: per-layer rank allocation driven by an empirical phase profile of the residual stream plus an attention-sink bypass.
- Axiom Gauge: gauge-optimal compression via the GL($d$) residual-stream symmetry, baked into the compressed weights at zero inference overhead.
- Thermal Rank + Tokens-per-Joule: NVML-driven rank scaling against thermal headroom, plus an energy-aware policy gradient added to the differentiable rank plan.
- Online Basis: Oja's-rule rank-1 PCA updates fired by speculative-decode rejection events, with per-layer staleness versioning.
What this paper does not contain: measured PPL or tok/s sweeps for any of the four mechanisms in isolation or in combination. Each mechanism is implemented and reachable from the host binary via a CLI flag; a sweep over (rank, scale, $\eta_0$, temp threshold) on the Paper 1 locked protocol is the obvious next milestone but is not in this revision. The publication threshold is the same as Paper 3 v0.2: real implementation, reviewable today, no fabricated empirical headline. Where a closed-form prediction can be stated, it is.
Abstract
Static rank-$r$ SVD compression of a transformer treats every layer, every
weight matrix, and every operating point as exchangeable. None of those
invariances actually hold: layers fall into one of three empirical phases
(mix, compress, refine) with different sensitivity to rank reduction; the
residual stream carries an exact GL($d$) gauge symmetry that no static SVD
exploits; consumer hardware enters thermal throttling within $30$ seconds
of sustained load and the optimal rank changes accordingly; and the
inference distribution drifts away from the calibration distribution,
so yesterday's PCA basis goes stale. We present four mechanisms shipped
in the geodessical runtime that address each invariance
failure individually: MCR phase-aware rank allocation with attention-sink
bypass, gauge-optimal compression via diagonal $g \in \mathrm{GL}(d)$
optimisation, thermal-budget-coupled rank scaling driven by NVML with a
tokens-per-joule objective term, and rejection-triggered online PCA via
Oja's rule. All four are reachable from the host binary CLI and individually
composable with the GP and OTT pipelines. The contribution is the
combination: each mechanism targets one specific invariance
failure, all four are zero- or near-zero-overhead at inference time, and
none requires retraining. The first measurement sweep is the v0.2
milestone.
Four invariance failures of one-rank-fits-all
The simplest GP compression replaces every weight matrix $W$ with a rank-$r$ factorisation $W \approx U_r S_r V_r^\top$ at a single fixed $r$. This works as a baseline (Paper 1) but bakes in four assumptions, each of which the literature now contradicts.
| Invariance assumed by static rank-$r$ | Empirical violation | Paper 6 mechanism |
|---|---|---|
| All layers compress equally well at the same rank | Queipo-de-Llano et al. (Oct 2025) show three measurable phases (Mix / Compress / Refine) with order-of-magnitude variance differences | §2 MCR rank allocation |
| The SVD spectrum is intrinsic to $W$ | The residual stream has an exact GL($d$) gauge; the spectrum of $W \cdot \mathrm{diag}(1/g)$ depends on $g$ | §3 Axiom Gauge |
| Optimal rank is hardware-independent | Consumer GPUs thermal-throttle at $\sim 85\,^\circ\mathrm{C}$ within $\sim 30\,\mathrm{s}$ of sustained load; energy-per-token doubles in throttled regime | §4 Thermal Rank + Tokens-per-Joule |
| The calibration distribution matches the inference distribution | Speculative-decode rejection events are a direct measurement of distribution drift | §5 Online Basis (rejection-driven Oja) |
Each row is a separate research thread in the existing literature; their combination here is not. The novelty of this paper is not any single mechanism , each has antecedents , but the orchestration: four invariances fail in different ways, each with a known fix, and all four fixes can be made zero-cost at inference time. We describe each in turn.
Detect the three layer phases, allocate accordingly
Queipo-de-Llano et al. (Oct 2025) showed that transformer layers process tokens in three measurable phases driven by the BOS attention sink:
- Mix (early): broad token mixing, high activation variance. Compression here is destructive because it discards the cross-token representation the model is actively building.
- Compress (middle): the model itself collapses attentional mass onto a small set of directions, residual-stream activation variance drops to its global minimum. Low-rank projection here is nearly free , we are replicating what the model already does.
- Refine (late): selective task-specific feature extraction. Variance rises; rank matters again, but for different reasons.
A static-rank pipeline ignores the structure. MCR detects phase boundaries empirically from the per-layer activation-variance profile and allocates non-uniform rank.
2.1 , Phase detection
For each layer $\ell$, compute the per-feature variance of the captured hidden states and reduce to a scalar $\sigma_\ell^2 = \mathbb{E}[\,\|h_\ell - \bar h_\ell\|^2\,] / d$. Apply a 3-tap moving average to suppress single-layer noise, denote $\tilde\sigma^2_\ell$ the smoothed variance, and define
with default threshold $c_\mathrm{thr} = 1.5$ ("within 50% of the global
variance minimum"). The implementation lives in runtime/nn/mcr_compress.h;
if the variance profile is flat (no clear minimum), the result sets
phases_valid = 0 and the runtime falls back to uniform rank.
2.2 , Rank allocation
Given the phase labels and a global rank budget $R = r_\mathrm{base} \cdot L$, allocate per-layer rank
with default scales $s_\mathrm{Mix} = 1.5$, $s_\mathrm{Compress} = 0.35$,
$s_\mathrm{Refine} = 1.2$ (CLI-tunable via
--axex-mcr-{mix,compress,refine}-scale). The result is then
re-normalised so the total rank budget matches the user-requested $R$.
2.3 , Attention-sink bypass
Sink tokens (BOS in particular) carry residual-stream L2 norms several
standard deviations above the mean. Their large activations may project
badly onto the calibration PCA basis, and the resulting reconstruction
error grows non-linearly past some compression ratio.
StreamingLLM preserves sink tokens in the KV cache; nobody (to
our knowledge) preserves the sink direction in the weight/activation PCA
basis. We do.
The procedure is simple: detect items with $\|h\|_2 > \mu + 3\sigma$ in the calibration set; check whether the dominant sink direction $\hat s$ is well-covered by the existing basis $V_r$ via $\max_k |\langle \hat s, v_k\rangle|$; if not, append $\hat s$ as an extra basis vector before layer initialisation. The cost is one rank slot per affected layer; the benefit is bounded reconstruction error on sink-dominated steps.
2.4 , Closed-form prediction
If the variance profile is real, MCR redistributes rank from Compress-phase layers (where the marginal value of rank is smallest) to Mix and Refine layers. At a fixed total budget the predicted PPL improvement scales with the ratio $\sigma^2_\mathrm{Mix} / \sigma^2_\mathrm{Compress}$, which is typically $\sim 5\!\times$ in published profiles , an order-of-magnitude lever. Whether the prediction holds in practice is the v0.2 measurement.
Exploiting the residual-stream symmetry
The transformer residual stream carries an exact gauge symmetry. For any invertible $G \in \mathrm{GL}(d)$, the substitution
leaves model outputs unchanged. "Read" matrices project the residual stream into a subspace (Q, K, V, FFN gate, FFN up); "write" matrices project back (attention output O, FFN down). The SVD spectrum of $W^\mathrm{read} G^{-1}$ depends on $G$ unless $G$ is orthogonal , so the $r$-rank truncation error is gauge-dependent, even though the model is not.
No static rank-$r$ scheme exploits this freedom. Axiom Gauge does. We parameterise $G = \mathrm{diag}(g)$, $g \in \mathbb{R}^d_{>0}$, and minimise the joint tail energy
where $\mathrm{tail}_r(M) = \|M\|_F^2 - \|\mathrm{trunc}_r(M)\|_F^2$ is the energy not captured by the top-$r$ SVD.
3.1 , Gradient in log space
Optimisation is done in $\lambda = \log g$ to keep $g > 0$ automatic. For a read matrix $W \in \mathbb{R}^{m \times d}$ with $X = W\,\mathrm{diag}(e^{-\lambda})$:
For a write matrix $W \in \mathbb{R}^{d \times n}$ with $X = \mathrm{diag}(e^{\lambda})\,W$, the sign flips:
These are derived directly from $\|M\|_F^2 = \mathrm{tr}(M^\top M)$ and $\partial_{\lambda_i} \mathrm{diag}(e^\lambda) = e^{\lambda_i} E_{ii}$. The gradient is sparse per coordinate, the SVDs are computed once per outer iteration, and 10–30 outer iterations is typically sufficient. After the final $g$ is found we normalise so $\prod_i g_i^{1/d} = 1$ to remove the global scale ambiguity (the residual stream itself is unchanged on average).
3.2 , Zero-overhead inference
Once $g^\star$ is found it is baked into the compressed factors before
upload to GPU. For a read matrix factored as $W \approx U_r S_r V_r^\top$,
the gauge-aware factor stored as d_Vt[k][i] is
$S_k V_{k,i}\,g_i$; for a write matrix, d_U[i][k] becomes
$U_{i,k}/g_i$. The inference path , the two-GEMV
$\mathrm{tmp} = V_r^\top x$, $\mathrm{out} = U_r\,\mathrm{tmp}$ , is
completely unchanged. This is the cleanest possible deployment: a free
pre-compute step at calibration time, no runtime branch, no extra memory
traffic.
Implementation is in runtime/nn/axiom_gauge.h;
the entry point is axex_gauge_optimize, which dequantises any
GGUF-quantised model on the fly and only processes the matrices involved
in SVD compression. Embedding and norm matrices are skipped because they
are not compressed by GP. The reported quantity
tail_after / tail_before bounds the achievable PPL gain.
3.3 , Closed-form prediction
The diagonal-gauge optimum reaches a stationary point on the $d$-dimensional manifold; the achievable tail reduction depends on the spread of feature scales across the read/write matrices. For models with LayerNorm-induced uniform feature scales the gain will be modest ($\lesssim 5\%$ tail energy); for models with strong per-feature scale asymmetry (Llama-3 with no per-token RMSNorm-pre-attention scaling, for instance) the gain may exceed $20\%$. The measurement is straightforward and is the §9 v0.2 milestone.
Coupling rank to thermal headroom
Consumer GPUs throttle. On the reference RTX 4070 Laptop, sustained Llama-3.1-8B inference reaches steady-state $\sim 85\,^\circ\mathrm{C}$ within $\sim 30$ seconds and the driver clamps the boost clock; tok/s falls to 50–60% of the cold-start measurement. Earlier work (throttLL'eM, HPCA 2025) responds by clamping the clock proactively. We respond by clamping the compression rank instead.
4.1 , NVML-driven rank scaling
The mechanism is a linear interpolation. With temperature thresholds $T_\mathrm{low}$ and $T_\mathrm{high}$ and rank bounds $r_{\min}, r_{\max}$:
Defaults are $T_\mathrm{low} = 65\,^\circ\mathrm{C}$,
$T_\mathrm{high} = 85\,^\circ\mathrm{C}$, $r_{\min}, r_{\max}$ user-supplied.
NVML is loaded dynamically (nvml.dll / libnvidia-ml.so);
if NVML is unavailable, nvml_ok = 0 and
thermal_get_rank returns the base rank unchanged. Polling is
rate-limited to once per poll_interval_ms = 500 ms by
default to avoid syscall overhead. An optional
$P_\mathrm{budget}\,$ W cap further reduces rank when current draw
exceeds the cap.
The contract is monotone: higher temperature implies fewer FLOPs per token implies less heat generated per token. The system is closed-loop stable as long as the heat-generation rate at $r_{\min}$ is below the cooling capacity at $T_\mathrm{high}$, which is the engineering definition of "the laptop does not melt". On hardware where this is not true, $r_{\min}$ should be set lower; this is a configuration question, not a correctness question.
4.2 , Tokens-per-joule as a diffplan objective
The differentiable rank plan of Paper 4's
geo_research module minimises reconstruction error with an L1
rank penalty. That optimises accuracy at fixed rank, not accuracy at fixed
energy cost. The TPJ module adds an explicit energy term to the
plan-level gradient. With observation
$J = P_\mathrm{NVML} / (\mathrm{tok/s})$ in joules per token and a slowly
estimated coefficient $\hat c_\mathrm{rank}$ (joules per unit-rank per
token), the policy-gradient contribution to the diffplan softmax is
where $p_\ell^{(r)}$ is the softmax probability the plan currently places on rank level $r$ at layer $\ell$, $R^{(r)}$ is the rank of that level, and $r^\mathrm{soft}_\ell = \sum_r p_\ell^{(r)} R^{(r)}$ is the expected rank. This is precisely the score-function gradient of the energy penalty with respect to the softmax parameters. The regularisation weight $\lambda$ defaults to $0.005$; bootstrap from a quick power+throughput measurement sets a non-zero $\hat c_\mathrm{rank}$ before any observed data so the first diffplan step already feels the energy gradient.
The objective the runtime optimises after this term lands is therefore
not just $\mathcal{L}_\mathrm{recon} + \alpha \|r\|_1$ but
$\mathcal{L}_\mathrm{recon} + \alpha \|r\|_1 + \lambda\,\hat c_\mathrm{rank}\,r^\mathrm{soft}$
, a weighted blend of accuracy, headroom, and energy. Implementation
in runtime/nn/thermal_rank.h.
4.3 , Closed-form prediction
On a 30 s sustained run with no thermal control, baseline tok/s drops by $\sim 40\%$ once throttling engages (Paper 1 §6). Thermal-Rank avoids the throttle altogether at the cost of a controlled rank reduction; if the rank–PPL elasticity is $\partial\mathrm{PPL}/\partial r < 1\%$ per rank-step in the operating regime (typical for $r \in [768, 1280]$ on Llama-3.1-8B), the trade is favourable. The measurement is a sustained 60-second decode comparing static $r=1024$ against thermal-controlled $r \in [768, 1024]$ on the locked protocol.
Coupling basis updates to model divergence
A static PCA basis fitted on calibration data drifts out of validity as inference moves into distributions the calibration did not cover. The drift is invisible to the runtime , until the basis becomes stale enough to cause output errors. Speculative decoding offers a free measurement of drift: every rejection is a position where the draft and verifier disagreed, which means the draft basis projected the prefix into a place the verifier disagreed with. The residual $h^\mathrm{tgt} - h^\mathrm{drft}$ is exactly the direction of basis-induced error.
5.1 , Oja's rule, decayed
Online Basis hooks the rejection path. For a layer $\ell$ with current basis $W \in \mathbb{R}^{k \times d}$ and a new sample $x = h^\mathrm{tgt} - h^\mathrm{drft}$, the Oja rank-1 update is
with deflation across the $k$ rows so the result remains orthonormal. The learning rate decays as $\eta_t = \eta_0 / \sqrt{t}$ to satisfy the Robbins–Monro conditions; default $\eta_0 = 0.01$. Convergence to the leading eigenvectors of the running covariance is classical (Oja 1982). What is new here is the trigger: the only samples we feed in are the residuals from rejection events, which by construction are the directions the current basis fails to capture.
5.2 , Per-layer staleness versioning
A bump-counter onb_layer_state_t::version is incremented on
every applied update. The GP path tracks the version it last consumed; if
the layer's version moves ahead, the cached projection of $W^\mathrm{read}$
by $V_r^\top$ is invalidated and recomputed lazily on the next forward.
Because rejections are rare (a few per generated token in a typical run)
the recomputation cost is amortised to nothing on the hot path.
Updates are applied between decode steps, not inside them, via
onb_apply_pending. The pending queue is bounded
(ONB_QUEUE_CAP = 256); if it fills, the oldest entry is
dropped. A minimum-rejection gate
(min_rejections_before_update = 4) avoids reacting to single
spurious rejections.
5.3 , Closed-form prediction
If the calibration distribution covers the inference distribution, online updates are a no-op. If the inference distribution is drifting linearly, the Oja-decayed estimator converges to the running covariance with bias $\propto \eta_t$; for $\eta_0 = 0.01$ and $t = 10^4$ samples the bias floor is $\sim 10^{-4}$ , well below typical PCA truncation error. Concretely we expect online basis to be invisible on Paper 1's matched calibration/test split and to show measurable PPL or acceptance-rate improvement on out-of-distribution prompts. Both are measurable on the §8 recipe.
Reference: Oja (1982); recent runtime applications include OjaKV (2025)
for online adaptive PCA on the KV cache. Implementation in runtime/nn/online_basis.h.
Orthogonal axes, near-zero overhead
The four mechanisms target distinct invariance failures and are architecturally orthogonal. MCR allocates rank across layers at calibration time. Axiom Gauge changes the basis the SVD is taken in at calibration time. Thermal Rank scales the chosen rank at run time as a function of hardware state. Online Basis modifies the basis matrix entries at run time as a function of model divergence.
| Mechanism | Where it acts | When it acts | Inference overhead | Composition with GP/OTT |
|---|---|---|---|---|
| MCR | per-layer $r_\ell$ allocation | calibration | zero | changes only the rank schedule |
| Axiom Gauge | basis entries (factor pre-multiply) | calibration | zero | commutes with MCR; baked into factors |
| Thermal Rank | active $r$ | run time, $\sim 0.5\,\mathrm{s}$ poll | $\mathcal{O}(1)$ NVML call | caps $r$ at $r(T)$ on top of MCR-allocated $r_\ell$ |
| Online Basis | basis matrix $W$ entries | run time, between steps | amortised zero (queue drain) | updates the same $W$ MCR + Gauge produced at calibration |
All four can be enabled simultaneously without contention. The runtime composition is: MCR sets $r_\ell$ → Gauge optimises $g$ at that schedule and bakes → Thermal Rank reads $r_\ell$ as ceiling and scales by $T$ → Online Basis incrementally adjusts the resulting $W$ matrix on rejection events. None of the four interferes with the others, by construction.
How to enable each mechanism
All four mechanisms are gated by CLI flags on the host binary:
# MCR (phase-aware rank allocation + sink bypass)
.\build_host\geodessical.exe --model models\smollm2-135m-instruct-q8_0.gguf \
--axex-compress --axex-compress-rank 1024 \
--axex-mcr-mix-scale 1.5 \
--axex-mcr-compress-scale 0.35 \
--axex-mcr-refine-scale 1.2 \
[...]
# Axiom Gauge (auto-iter chooses 10 small / 1 large)
.\build_host\geodessical.exe --model ... \
--axex-compress --axex-compress-rank 1024 \
--axex-gauge \
--axex-gauge-iter 0 \
[...]
# Thermal Rank (NVML required; falls back gracefully if unavailable)
.\build_host\geodessical.exe --model ... \
--axex-compress --axex-compress-rank 1024 \
--axex-thermal-low 65 \
--axex-thermal-high 85 \
--axex-thermal-power 0 \
[...]
# Online Basis (only meaningful with speculative decode)
.\build_host\geodessical.exe --model ... \
--ott-full --ott-speculative \
--axex-online-basis \
[...]
All four can be combined; ordering does not matter. The default values are deliberately conservative; the v0.2 sweep will tune each independently against PPL and tok/s on the locked protocol.
What v0.1 has and what v0.2 will measure
Now landed (v0.1):
- Implementation of all four mechanisms in
runtime/nn/{mcr_compress,axiom_gauge,thermal_rank,online_basis}.{h,c}. - CLI surface in
host/main.c. - Closed-form prediction stated for each mechanism in §§2.4, 3.3, 4.3, 5.3.
- Composition argument in §6.
Still missing for v0.2:
- MCR phase profile measurement on Llama-3.1-8B and SmolLM2-135M, with the resulting rank allocation table.
- Axiom Gauge tail-energy reduction percentage on each model and a PPL sweep at fixed rank.
- Thermal Rank sustained-decode measurement (60 s, locked protocol) against static rank baseline.
- Online Basis acceptance-rate delta on out-of-distribution prompts in OTT speculative mode.
- Combined-stack measurement: MCR + Gauge + Thermal + Online vs static GP, both PPL and tok/s.
- Citation pass and full bibliography.
The publication threshold for v0.1 is implementation reality and a stated closed-form prediction for each mechanism. The threshold for v0.2 is one measurement per mechanism plus the combined stack.
What v0.1 does not establish
Paper 6 is, by construction, a design paper with a stated closed-form prediction per mechanism rather than a measurement paper. The boundaries below are what the next version is expected to close.
- No measurement headline. All four mechanisms are implementation-real and CLI-reachable, but none has a measured end-to-end PPL or tok/s number in this version. The §§2.4 / 3.3 / 4.3 / 5.3 predictions are derivations from the static GP baseline under explicit assumptions, not benchmarks.
- Composition is asserted, not measured. §6 argues that the four mechanisms are orthogonal, MCR allocates rank, Gauge rotates the basis, Thermal scales rank with NVML signal, Online Basis updates the basis from rejection events, and therefore stack additively. The assertion is load-bearing and is the v0.2 combined-stack measurement listed in §9.
- Thermal Rank depends on NVML. The tokens-per-joule policy gradient assumes a working NVML telemetry path. On hardware where NVML is absent, throttled, or coarse (some laptop SKUs, some cloud instances), the controller falls back to a static-rank schedule; the fallback is implemented but its quality is not measured.
- Online Basis assumes rejection events. The Oja's-rule update is fired by speculative-decoding rejections (Paper 3 §5.5). In a non-speculative deployment the trigger does not exist and the basis is static; this is by design but means Paper 6's online claim is gated on the spec path being live.
- Gauge optimisation is offline. The Axiom Gauge diagonal-$g$ search runs at build time, not during decode. "Zero inference overhead" refers specifically to the runtime forward pass; it is not a claim about the build-time cost of finding $g$, which scales with the Phase-1 cloud size.
- No comparison against AWQ / GPTQ / SmoothQuant. The baseline in this paper is static Geodesic Projection (Paper 2). A head-to-head against quantisation-side adaptive schemes is open work.
None of these block the runtime from being usable today; they block the publication-grade adaptive-compression claim. v0.2 is scoped to close items 1, 2, and 3.
Where to find definitions
Paper 6 reuses the glossaries in Paper 1 §0.5 (rank $r$/$k$, decode vs prefill, residual stream) and Paper 2 §0.5 (PCA basis, projection slot, geometry cache, depth-sink, MCR/Ricci rank allocation). Speculative-path terms (acceptance rate $\alpha$, draft/verifier, OneDecode, OTT) are in Paper 3 §0.5. Mechanism-specific terms, phase profile, tail energy, gauge group GL($d$), tokens-per-joule, Oja's rule, are introduced inline at first use.
Selected refs
- Oja, E., A simplified neuron model as a principal component analyzer, J. Math. Biol., 1982.
- Queipo-de-Llano, P., et al., Three Phases of Transformers, Oct. 2025 (LeCun, Bronstein, co-authors).
- Xiao, G. et al., Efficient Streaming Language Models with Attention Sinks (StreamingLLM), ICLR 2024.
- OjaKV authors, Online Adaptive PCA for KV Cache via Oja's Rule, Sep. 2025.
- throttLL'eM authors, Adaptive Frequency Scaling for LLM Inference, HPCA 2025.
- RAP authors, Reinforcement-Learning Adaptive Pruning per Request, 2024.
- Stewart, W. K. O., Compressing Llama-3.1-8B Attention via Per-Slot SVD, this site, Paper 1, 2026.
- Stewart, W. K. O., Geodesic Projection: GP-Compressed Llama Runtime, this site, Paper 2, 2026.