Geodesic Projection: A Production Compression Pipeline, HyperTensor

Scope of this paper

Paper 1 isolates one design choice and reports an end-to-end measurement on one model. This paper describes the full compression pipeline that the geodessical runtime implements and the cross-architecture intrinsic-dimensionality evidence that motivates it. Two things this paper does not do, and the reader should treat them as scope limits:

It does not provide multi-model, end-to-end perplexity and throughput sweeps. The empirical anchor remains the Llama-3.1-8B-Instruct Q4_K_M number set from Paper 1.
It does not cover the 70B model. The runtime supports it; the measurement pass is queued for EC2 (compute approved) and the pack is not yet in the repository. Numbers will appear in v0.3.

The phrase used throughout to describe what is not measured is design‑validated: the code path exists, runs without error, and is consistent with the implemented behaviour of the components it composes, but no end-to-end benchmark has produced numbers for it.

§0, Abstract

Abstract

Geodesic Projection (GP) is the multi-slot, per-layer attention and FFN compression scheme implemented in the geodessical C11 runtime. It extends the calibration-free attention compression of Paper 1 along three axes: per-matrix slot coverage ($Q$, $K$, $V$, $O$, FFN up, FFN gate, FFN down); per-layer rank selection driven by a manifold-curvature heuristic with a hard floor; and a dedicated SVD-based path for FFN down, whose singular-value profile is fundamentally less compressible than the others. Two engineering pieces make GP usable in practice: (a) a persistent geometry cache that drops startup from minutes to seconds on repeat runs, and (b) a depth-sink shortcut that lets the cache be probed without re-running the full axiom-discovery pass. The cross-architecture observation that motivates the whole pipeline, that the local activation manifold of a trained transformer is roughly $11--25$-dimensional regardless of the ambient $d \in \{576, 1536, 3072\}$, reproduces on three open-weight models (SmolLM2-135M, Gemma-4-E2B, Phi-3.5-mini) using the manifold-extraction pipeline checked in under legacy/axiom_vis/.

§0.5, Glossary

Terms

Term	Definition
GP (in-house)	Geodesic Projection. The full compression pipeline of this paper. Generalises the attention-only scheme of Paper 1 to all seven weight matrices in a transformer block.
Slot (in-house)	One of the seven compressible weight matrices per transformer block: $Q$, $K$, $V$, $O$ (attention) and $W_\text{up}$, $W_\text{gate}$, $W_\text{down}$ (FFN). Each gets its own basis $U^{(\ell, s)}$.
$U^{(\ell, s)}$	The per-layer-per-slot orthonormal basis. Built from the dominant eigenvectors of $W^{(\ell, s)} (W^{(\ell, s)})^\top$ (or the right singular vectors for FFN down).
Intrinsic dimension	The dimensionality of the lowest-dimensional manifold on which the model's activations approximately lie. Estimated here by retaining the smallest $k$ such that PCA on a set of activation samples explains $\geq 95\%$ of the variance. See refs [2, 7].
Geometry cache (in-house)	The on-disk artefact (`ott_geometry.bin`) holding the manifold-extraction output: per-layer eigenspectra, axiom set, depth-sink layer, intrinsic-dim estimate, and an integrity hash.
$W_\text{proj}$ cache	The on-disk artefact holding the projected weights $W^{(\ell, s)} U^{(\ell, s)} \in \mathbb{R}^{d \times k}$. Mapped into VRAM at load time. Distinct from the geometry cache, which holds bases not projected weights.
Depth-sink (in-house)	The single layer at which the residual-stream effective dimensionality saturates. Its eigenspectrum dominates the cache validity check, so the cache can be revalidated by reading the geometry of one layer instead of all of them.
MCR / Ricci heuristic (in-house)	Manifold-Curvature-Ratio: a per-layer scalar derived from the local Ricci-style curvature of the activation manifold. Used to allocate larger ranks to layers with steeper local geometry. Capped at $k = \texttt{AXEX_MANIFOLD_K_MAX}$ above and at a per-model floor below.
FFN down	The matrix $W_\text{down} \in \mathbb{R}^{d \times d_\text{ffn}}$ that contracts the FFN intermediate space ($d_\text{ffn} = 14{,}336$ for Llama-3.1-8B) back to model dim $d = 4096$. Its singular-value spectrum is much flatter than the attention slots, so it gets a dedicated SVD path rather than the eigendecomposition route.
AXEX (in-house)	The runtime flag prefix for the GP machinery. `--axex-compress`, `--axex-attn-only`, `--axex-skip-o`, `--axex-weight-pca`, `--axex-compress-rank N`, `--axex-kv`. Defined in runtime/nn/axiom_exploit.h.

§1, From Paper 1 to GP

What Paper 1 measures, what GP generalises

Paper 1 presents a deliberately narrow design: shared rank $k$ across layers, attention slots only ($Q$, $K$, $V$; $O$ skipped), no FFN compression, no quality fine-tune. The narrowness is the point, it lets the calibration-free claim and the super-baseline claim be tested without confounding from other compression knobs.

The runtime implements a more general scheme. The shape of the generalisation is summarised below. Each row is a knob in the production pipeline that Paper 1 holds fixed.

Knob	Paper 1 setting	GP setting
Slots compressed	$Q$, $K$, $V$ only	$Q$, $K$, $V$, $O$, FFN up, FFN gate, FFN down (configurable)
Rank $k$ across layers	Shared	Per-layer (MCR/Ricci-driven, with a $k$ floor)
Decomposition	Eigendecomposition of $W W^\top$	Same for $Q/K/V/O$ and FFN up/gate; SVD for FFN down
Build cost	Paid every run	Paid once; cached in `ott_geometry.bin` + `W_proj`
Cache validation	n/a	Depth-sink-layer eigenspectrum hash + weight-blob hash
KV-cache projection	Off	Optional (`--axex-kv`)

Paper 1's measurement uses the GP code path with all but the first knob held in the simplest position. The numbers reported there therefore are GP numbers in the limit where GP collapses to attention-only fixed-rank. What GP buys above and beyond is (a) coverage of the FFN, which is where the bulk of Llama-3.1-8B's parameters live (FFN ≈ 70% of weights in this architecture), and (b) the cache architecture that makes the build cost a one-time charge instead of a per-run charge.

The honest position on FFN compression is that the runtime can compress it but that Paper 1 deliberately did not because the FFN's singular-value spectrum on Llama-3.1-8B is dramatically flatter than the attention slots'. We present the spectrum evidence in §3 and the consequence for compression quality in §7.

§2, The cross-architecture manifold evidence

Why a low-rank basis exists at all

The premise of GP is that the local activation manifold of a trained transformer has intrinsic dimension much smaller than the ambient model dim $d$. If that premise is false, low-rank weight projection should be uniformly destructive. The manifold-extraction pipeline checked in under legacy/axiom_vis/ runs four phases on a model, manifold sampling, symmetry detection, curvature estimation, and axiom-set extraction, and emits per-phase JSON. We summarise the Phase 1 (intrinsic-dimensionality) and Phase 4 (axiom-set) outputs below for three open-weight models.

Model	$d$	Intrinsic dim (95% var)	$k_\text{int}/d$	Samples	Axiom set size	Consistency
SmolLM2-135M	576	17	2.95%	64	24 / 96	0.921
Gemma-4-E2B	1,536	25	1.63%	64	23 / 92	0.961
Phi-3.5-mini	3,072	11	0.36%	64	22 / 96	0.959

Source data: phase1_manifold.json and phase4_axioms.json for each model under legacy/axiom_vis/<model>/. "Consistency" is the Phase 4 self-consistency score: fraction of the discovered axiom set that survives held-out rebuild on a disjoint sample.

What this is and is not

What it is: three independent reproductions of the "low-intrinsic-dimensionality of LLM activations" finding [refs 1, 2, 7] on models that span an order of magnitude in $d$ (576 -> 3,072) and three architectures (Llama-style decoder, Gemma-style hybrid, Phi-style decoder). All three converge on $k_\text{int} \in \{11, 17, 25\}$, that is, $k_\text{int} \ll d$ uniformly, and $k_\text{int}$ does not grow with $d$.

What it is not: a guarantee that weight-space PCA of the individual $Q/K/V/O$ matrices captures this same intrinsic dimension. Activation-space intrinsic dim is necessary for the premise but not sufficient for the GP construction. The actual weight-space evidence (singular-value spectra of all seven slots on Llama-3.1-8B) is in §3: it strongly supports the GP construction for the four attention slots and quantitatively contradicts it for the three FFN slots, which is exactly the asymmetry Paper 1 leans on.

§3, The seven slots and their spectra

Why FFN-down gets a different code path

The seven compressible matrices in a Llama-style transformer block, with shapes for Llama-3.1-8B ($d = 4{,}096$, $d_\text{ffn} = 14{,}336$, $h = 32$ heads):

Attention   Q          : [d, d]            =  [4096, 4096]
            K, V       : [d_kv, d]         =  [1024, 4096]   (GQA: 8 KV heads x 128)
            O          : [d, d]            =  [4096, 4096]
FFN         W_up       : [d_ffn, d]        =  [14336, 4096]
            W_gate     : [d_ffn, d]        =  [14336, 4096]
            W_down     : [d, d_ffn]        =  [4096, 14336]

Note: Llama-3.1-8B uses grouped-query attention, so $K$ and $V$ have only $d_\text{kv} = 1{,}024$ output rows (not $d = 4{,}096$). This is one reason rank $k = 1{,}024$ is a particularly natural Pareto knee: on $K$ and $V$ the projection becomes lossless (you cannot have more singular values than the smaller matrix dimension), so the only quality cost at $k = 1{,}024$ comes from $Q$ and (when included) $O$.

The attention slots, plus $W_\text{up}$ and $W_\text{gate}$, are tall in the same direction (fewer rows than columns is allowed too; what matters is that the "compressed" dimension is the one the basis is built over). For these slots GP uses the same construction as Paper 1: form the Gram matrix $K = W W^\top \in \mathbb{R}^{d \times d}$, take its top-$k$ eigenvectors $U \in \mathbb{R}^{d \times k}$, and store the projected weight $W_\text{proj} = W^\top U \in \mathbb{R}^{\text{out-dim} \times k}$.

$W_\text{down}$ is the one slot where this construction is unsatisfying. Its eigenspectrum on Llama-3.1-8B is approximately flat over the first $\sim$3,400 components, see Paper 1 §10, Figure 3, meaning that truncating to $k = 1024$ drops more than 30% of its Frobenius energy on every layer we measured. For this reason the production code path for $W_\text{down}$ uses a direct SVD rather than the eigendecomposition-of-Gram route, and the runtime exposes --axex-skip-o / --axex-attn-only flags that exclude $W_\text{down}$ (and $O$) entirely when the user judges the FFN penalty to be the dominant quality cost. Paper 1 runs in exactly this configuration.

3.1 Per-slot spectra on Llama-3.1-8B (measured)

Singular-value spectra for all seven slots, computed by full SVD on the dequantised Q4_K_M weights at layers $\{0, 7, 15, 23, 31\}$, see scripts/analysis/compute_spectra.py and the JSON output at docs/figures/spectra_summary.json. All values are layer-means; ranges across the five layers are given where they are informative.

Slot	Shape	Rank for 95% energy (layer-mean, range)	Rank for 99% (layer-mean)	Energy retained at $k=1{,}024$ (layer-mean)
$Q$	[4096, 4096]	1,682 (635–2,155)	2,434	0.836
$K$	[1024, 4096]	605 (253–724)	802	1.000 (GQA, dim cap)
$V$	[1024, 4096]	809 (783–835)	954	1.000 (GQA, dim cap)
$O$	[4096, 4096]	2,118 (1,947–2,342)	2,868	0.743
$W_\text{gate}$	[14336, 4096]	3,271 (3,199–3,304)	3,840	0.539
$W_\text{up}$	[14336, 4096]	3,360 (3,304–3,408)	3,873	0.492
$W_\text{down}$	[4096, 14336]	3,345 (3,293–3,407)	3,863	0.490

What this table actually says

The four attention slots are highly compressible: 95% of their Frobenius energy lives in $\sim 1{,}600$–$2{,}100$ of $4{,}096$ singular directions ($\sim 41\%$–$52\%$ of full rank), and at the production setting $k = 1{,}024$ they retain $74\%$–$84\%$ of their energy ($Q$, $O$) or $100\%$ ($K$, $V$, where GQA already caps the rank at $1{,}024$).

The three FFN slots are not. 95% of their energy needs $\sim 3{,}300$ of $4{,}096$ ($\sim 80\%$ of full rank), and truncating to $k = 1{,}024$ drops $\sim 50\%$ of their Frobenius energy on every layer measured. This is the quantitative version of the qualitative claim in Paper 1 ("FFN is flat"): the FFN slots' rank-for-95%-energy is roughly $\mathbf{2\times}$ that of the attention slots, and their retained energy at $k = 1{,}024$ is $\mathbf{\sim 0.5}$ vs $\mathbf{\sim 0.8}$ for attention. $W_\text{down}$ behaves like $W_\text{up}$ and $W_\text{gate}$ (mean rank 3,345 / energy retained 0.490), which is why the runtime gives $W_\text{down}$ the dedicated SVD path: not because its spectrum is qualitatively different from the other FFN slots, but because the Gram-matrix construction is numerically wasteful when the spectrum is approximately flat.

Caveat preserved: these spectra are weight-space, single model (Llama-3.1-8B), single quantisation (Q4_K_M dequantised). The cross-model activation-manifold evidence in §2 is independent of and does not extend to per-slot weight spectra on the other three models. We have not measured those.

§4, Per-layer rank selection (MCR/Ricci)

Allocating the rank budget

A naive scheme assigns the same $k$ to every layer. Paper 1 uses this scheme. GP exposes a heuristic that allocates more rank to layers with steeper local geometry on the activation manifold. Concretely: at axiom-discovery time the pipeline emits a per-layer Manifold-Curvature-Ratio (MCR) scalar, derived from a coarse Ricci-style curvature estimate computed on the activation samples for that layer's input. Layers with high MCR are allocated rank closer to AXEX_MANIFOLD_K_MAX (currently $1{,}536$, see runtime/nn/axiom_exploit.h); layers with low MCR are allocated less. A hard floor prevents pathological collapse:

k_layer = clamp(round(k_target * mcr_layer / mcr_mean),
                k_floor,
                AXEX_MANIFOLD_K_MAX)

The $k_\text{floor}$ matters more than the upper cap. Without it, the heuristic aggressively starves shallow layers in small models, and we observed an immediate decode-quality collapse on SmolLM2-135M ($d = 576$) when $k$ for any attention layer fell below approximately $0.4 d$. With the floor at $k_\text{floor} = 0.55 d$ (≈ 320 for SmolLM2; ≈ 845 for Gemma-4-E2B; ≈ 1,690 for Phi-3.5-mini, which exceeds the cap and so falls back to it), the failure mode disappears. This is documented here as a contingency of the heuristic, not as a measurement: we have not run a quality sweep that pins down the precise floor curve.

Open knob

The MCR-driven per-layer rank assignment is the part of GP we are least confident about. It is enabled by default in the runtime, but Paper 1's headline numbers all use shared-rank ($k_\text{layer} = k$ for all $\ell$) because the MCR setting changes too many things at once for the calibration-free claim to remain clean. A future version of this paper will hold $k_\text{mean}$ fixed and compare shared-rank vs MCR-driven on PPL and decode tok/s on Llama-3.1-8B. That sweep does not yet exist.

§5, The geometry cache and the depth-sink shortcut

Turning a minutes-long startup into a seconds-long startup

Building the GP bases from scratch on Llama-3.1-8B takes roughly 70 seconds on the reference RTX 4070 Laptop. This is dominated by the eigendecompositions of 32 layers × 7 slots = 224 Gram matrices. On the 70B target the same pass is extrapolated at ~70 minutes (80 layers × 7 slots, larger $d$). To make GP practical the runtime persists two artefacts:

ott_geometry.bin, per-layer spectra, axioms, depth-sink index, intrinsic-dim estimate, weight-blob hash, integrity hash. Small (a few MB).
W_proj cache (one file per slot per layer), the projected weights themselves. Large; for Llama-3.1-8B at $k=1{,}536$ this is the 1,093 MB figure cited in Paper 1.

Both caches are keyed on a hash of (model file digest, AXEX flag set, target rank). Mismatches force a full rebuild rather than risking a stale basis.

5.1 The depth-sink shortcut

Reloading the full geometry cache on a 70B model is not free, at minimum we want to verify the cache is not stale. The runtime's solution is the depth-sink layer: empirically, transformer activations saturate in effective dimensionality at a single layer roughly two-thirds of the way through the stack, and the spectrum at that one layer is enough to detect almost all weight-blob corruption. The cache integrity check therefore reads only the depth-sink layer's spectrum from ott_geometry.bin and recomputes it on the fly, which on Llama-3.1-8B takes under 200 ms.

The depth-sink is identified during the original axiom run as the layer where the cumulative-explained-variance curve flattens to within $10^{-3}$ of its plateau. On the four models we have inspected, this is layer 21/32 (SmolLM2), layer 19/26 (Gemma-4-E2B), layer 22/32 (Phi-3.5-mini), and layer 22/32 (Llama-3.1-8B). The two-thirds rule is a property of these four models, not a theorem; we flag it explicitly because we don't know yet whether it holds for substantially deeper architectures.

§6, Empirical anchor: Llama-3.1-8B Q4_K_M

Numbers GP shares with Paper 1, and what changes when knobs move

The end-to-end numbers GP produces on its empirical anchor, Llama-3.1-8B-Instruct Q4_K_M, RTX 4070 Laptop, locked 30-second cooldown protocol, are the Paper 1 numbers by construction, since GP at attention-only-shared-rank is the Paper 1 configuration. We summarise them here with the GP framing rather than the calibration-free framing.

Configuration	Decode tok/s (% of baseline)	PPL (WikiText-2, 512 tok)	$W_\text{proj}$ disk
Baseline (no GP)	100.00%	6.7902	,
GP attn-only, shared $k = 1{,}024$	106.27%	10.9585 (+61.39%)	729 MB
GP attn-only, shared $k = 1{,}536$	97.55%	7.6936 (+13.30%)	1,093 MB
GP attn-only, $k = 2{,}048$ requested	101.04%	7.6936 (+13.30%)	1,093 MB

All four PPL values measured under --ppl-eval on WikiText-2 first 512 tokens; raw outputs in docs/figures/ppl_sweep/llama31_8b_ppl_sweep.json (date 2026-04-22). Two structural observations: (i) $k=1{,}024$ collapses quality (+61% PPL) because the K/V projection dimension on GQA-Llama-3.1-8B is exactly $1{,}024$ , at $k=1{,}024$ the Q matrix is rank-deficient against its full $4{,}096$-dim, while K and V are at the boundary; (ii) $k=1{,}536$ and $k=2{,}048$ produce identical PPL because once $k \ge 1{,}024$ the K and V matrices are full-rank (lossless), so additional rank only affects Q, whose PCA energy is already saturated by $k=1{,}536$. The $k=2{,}048$ disk footprint matches $k=1{,}536$ because AXEX_MANIFOLD_K_MAX in runtime/nn/axiom_exploit.h silently caps the stored basis. Operationally: $k=1{,}536$ is the Pareto rank for this model on this protocol , $k=2{,}048$ is wasted memory and compute, $k=1{,}024$ is unsafe.

6.1 What is not in this table

FFN-included configurations. The runtime supports them; the benchmark pass is gated on a 30-second-cooldown rerun and is not in the v0.2 pack.
MCR-driven per-layer rank vs shared rank. Same gating.
End-to-end numbers on SmolLM2-135M, Gemma-4-E2B, Phi-3.5-mini. These models loaded and ran under GP during the manifold-extraction pass; we observed qualitatively that decode remained coherent at attn-only $k = 0.6 d$ but did not produce paired-CI throughput or PPL numbers under the locked protocol. The honest claim is "GP runs without crashing on these four models", not "GP is measured on these four models."
Llama-3.1-70B. Compute is approved, the run is queued; no numbers yet.

§7, Where GP costs more than it saves

Failure modes and contraindications

7.1 Small $d$ (≤ 1024)

On SmolLM2-135M ($d = 576$) we observed text-quality collapse at any attention rank below roughly $0.4 d$. The mechanism we suspect: at small $d$, attention head dim $d_h = d/h$ is already small (≈ 64), so further compression eats into the per-head rank budget in ways that the per-block PCA does not preserve. The runtime defends against this with the $k_\text{floor}$ in §4; in practice this means GP at $k_\text{floor} = 0.55 d \approx 320$ on SmolLM2 is near-lossless and at $k = 200$ it is broken. There is no useful compression to be had on this model, the floor and the ceiling are too close together.

7.2 FFN-only configurations

The FFN-down spectrum is too flat for the eigendecomposition route. The runtime's SVD path makes the math work but does not change the underlying compressibility of the matrix. Empirically, FFN-only GP at $k = 1024$ on Llama-3.1-8B is not faster than baseline (it is, in the spot checks, slightly slower, since it adds a projection without saving enough bandwidth). We have not characterised this pattern with a full benchmark sweep and so we cite it here as observation rather than measurement.

7.3 The fp32 $W_\text{proj}$ format

The persistent $W_\text{proj}$ cache is currently stored as fp32. Converting it to Q8_0 would close most of the disk-footprint gap but would re-quantise weights that have already been quantised once (Q4_K_M dequantised -> fp32 -> Q8_0), and the behaviour at the second quantisation boundary is not characterised. This is the same concern Paper 1 §11 raises and applies identically here.

7.4 Cache invalidation cost

A weight-blob hash mismatch forces full re-axiom plus full $W_\text{proj}$ rebuild. On the 70B target this would cost ~70 minutes and ~14 GB of disk. Operationally this means model upgrades are an event, not a transparent rolling deploy.

§8, Reproduction

Commands and expected output

The Paper 1 reproduction recipe at repro/REPRODUCE.md reproduces the empirical-anchor row of §6. The cross-model intrinsic-dim numbers from §2 are reproducible via:

# intrinsic-dim re-run for any of the three models in legacy/axiom_vis/
.\build_host\geodessical.exe <model.gguf> \
    --axex-axiom-only --axex-export-vis <outdir>
# then read <outdir>/phase1_manifold.json

Tolerance: intrinsic-dim estimate is stable to ±1 across reseeded runs of the Phase 1 sampler at $n_\text{samples} = 64$; consistency score is stable to ±0.02. The actual axiom set is sample-dependent (random projections); only the size and the consistency are stable.

§9, Status

What this paper is missing before it is finished

Multi-model end-to-end PPL and decode tok/s under the locked protocol (SmolLM2, Gemma-4-E2B, Phi-3.5-mini, plus 70B).
FFN-included sweep on Llama-3.1-8B at fixed total parameter budget.
MCR vs shared-rank A/B at matched mean-$k$.
Per-slot spectra on the other three reference models (SmolLM2-135M, Gemma-4-E2B, Phi-3.5-mini); only Llama-3.1-8B is measured today (see §3.1).
Cache-rebuild timing on 70B (currently extrapolated, not measured).

None of the above blocks the runtime from being usable today; they block the publication-grade version of the GP claim.

§9.5, Related work

Where Geodesic Projection sits in literature

Paper 1 §3.5 placed the single-slot $Q/K/V$ basis next to ASVD, SliceGPT, FWSVD, GPTQ, AWQ, and LoRA. Paper 2 inherits that placement for the attention slots and adds three more axes that do not appear in those works: multi-slot fitting across $Q/K/V/O$ plus FFN up/gate/down, persistent geometry caching, and a depth-sink shortcut that lets later layers re-use earlier bases.

Activation low-dimensionality. The intrinsic-dim result in §2 is consistent with the isotropy/anisotropy line of work (Cai et al. 2021; Razzhigaev et al. 2023) and the linear-representation hypothesis (Park et al. 2023). Paper 2's contribution is operational rather than theoretical: it commits to a single PCA basis per slot, ships it to a runtime, and caches it across quantisation levels.
FFN compression. ASVD (Yuan et al. 2023) and FWSVD (Wang et al. 2023) cover activation-aware FFN decomposition; the FFN-down SVD path in §3 is a deliberate reduction of those ideas to the smallest scheme that shares the geometry cache used by the attention slots, not a new contribution to FFN compression on its own.
Per-layer rank allocation. The MCR/Ricci heuristic in §4 sits next to compression-aware bit-budgeting work (GPTQ, AWQ, OmniQuant) but operates on rank rather than bits, on a fixed-precision integer budget, and against a curvature signal computed from the same Phase-1 cloud that produces the basis. The closest published precedent is Kimi Team's block-summary curvature signal in Block Attention Residuals (arXiv:2603.15031, 2026); the connection is made explicit in Paper 3 §4 and Paper 4 §3.
Geometry caching. The persistent $W_\text{proj}$ cache (1,093 MB on Llama-3.1-8B Q4_K_M, see §5) is closer in spirit to a compiled artefact (e.g. TensorRT plan caches, Triton kernel caches) than to existing compression work. We are not aware of a published compression pipeline that exposes the basis as a cache shared across quantisation levels for the same architecture; if there is one, this paragraph is the place we will cite it.
What we are not doing. We do not retrain (so this is not LoRA / DoRA territory), we do not change the attention algorithm (so this is not FlashAttention / Ring Attention territory), and we do not change the quantisation format (Q4_K_M is taken as a fixed input). Geodesic Projection is a pre-quantisation geometry pass; the papers it interacts with most are Paper 1 (the empirical anchor), Paper 3 (speculative decoding on top), and Paper 6 (adaptive layer that turns the static rank in this paper into a phase-aware schedule).

§10, References

Selected refs

Cai, T. et al., Isotropy in the Contextual Embedding Space (ACL 2021), activation low-dimensionality on language models.
Razzhigaev, A. et al., The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models (EACL 2023), TwoNN/PCA intrinsic-dim across architectures.
Eckart, C. and Young, G., The Approximation of One Matrix by Another of Lower Rank, Psychometrika (1936), optimality of truncated SVD.
Hu, E. et al., LoRA: Low-Rank Adaptation of Large Language Models (ICLR 2022), low-rank decomposition of weight updates.
Williams, S. et al., Roofline: An Insightful Visual Performance Model, CACM (2009), bandwidth/compute roofline.
Touvron, H. et al., Llama 3 Herd of Models (2024), reference architecture.
Park, K. et al., The Linear Representation Hypothesis (2023), supports the manifold premise of §2.
Kimi Team, Block Attention Residuals (arXiv:2603.15031, 2026), cited for the AttnRes interaction discussed in Paper 3.