Paper 1 isolates one design choice and reports an end-to-end measurement on
one model. This paper describes the full compression pipeline that the
geodessical runtime implements and the
cross-architecture intrinsic-dimensionality evidence that motivates it. Two things
this paper does not do, and the reader should treat them as scope limits:
- It does not provide multi-model, end-to-end perplexity and throughput sweeps. The empirical anchor remains the Llama-3.1-8B-Instruct Q4_K_M number set from Paper 1.
- It does not cover the 70B model. The runtime supports it; the measurement pass is queued for EC2 (compute approved) and the pack is not yet in the repository. Numbers will appear in v0.3.
The phrase used throughout to describe what is not measured is design‑validated: the code path exists, runs without error, and is consistent with the implemented behaviour of the components it composes, but no end-to-end benchmark has produced numbers for it.
Abstract
Geodesic Projection (GP) is the multi-slot, per-layer attention and FFN
compression scheme implemented in the geodessical C11 runtime.
It extends the calibration-free attention compression of Paper 1 along
three axes: per-matrix slot coverage ($Q$, $K$, $V$, $O$, FFN up,
FFN gate, FFN down); per-layer rank selection driven by a
manifold-curvature heuristic with a hard floor; and a dedicated
SVD-based path for FFN down, whose
singular-value profile is fundamentally less compressible than
the others. Two engineering pieces make GP usable in practice:
(a) a persistent geometry cache that drops startup from
minutes to seconds on repeat runs, and (b) a depth-sink
shortcut that lets the cache be probed without re-running the full
axiom-discovery pass. The cross-architecture observation that motivates the
whole pipeline, that the local activation manifold of a trained transformer
is roughly $11--25$-dimensional regardless of the ambient $d \in \{576, 1536, 3072\}$, reproduces on three open-weight models (SmolLM2-135M, Gemma-4-E2B,
Phi-3.5-mini) using the manifold-extraction pipeline checked in under
legacy/axiom_vis/.
Terms
| Term | Definition |
|---|---|
| GP (in-house) | Geodesic Projection. The full compression pipeline of this paper. Generalises the attention-only scheme of Paper 1 to all seven weight matrices in a transformer block. |
| Slot (in-house) | One of the seven compressible weight matrices per transformer block: $Q$, $K$, $V$, $O$ (attention) and $W_\text{up}$, $W_\text{gate}$, $W_\text{down}$ (FFN). Each gets its own basis $U^{(\ell, s)}$. |
| $U^{(\ell, s)}$ | The per-layer-per-slot orthonormal basis. Built from the dominant eigenvectors of $W^{(\ell, s)} (W^{(\ell, s)})^\top$ (or the right singular vectors for FFN down). |
| Intrinsic dimension | The dimensionality of the lowest-dimensional manifold on which the model's activations approximately lie. Estimated here by retaining the smallest $k$ such that PCA on a set of activation samples explains $\geq 95\%$ of the variance. See refs [2, 7]. |
| Geometry cache (in-house) | The on-disk artefact (ott_geometry.bin) holding the manifold-extraction output: per-layer eigenspectra, axiom set, depth-sink layer, intrinsic-dim estimate, and an integrity hash. |
| $W_\text{proj}$ cache | The on-disk artefact holding the projected weights $W^{(\ell, s)} U^{(\ell, s)} \in \mathbb{R}^{d \times k}$. Mapped into VRAM at load time. Distinct from the geometry cache, which holds bases not projected weights. |
| Depth-sink (in-house) | The single layer at which the residual-stream effective dimensionality saturates. Its eigenspectrum dominates the cache validity check, so the cache can be revalidated by reading the geometry of one layer instead of all of them. |
| MCR / Ricci heuristic (in-house) | Manifold-Curvature-Ratio: a per-layer scalar derived from the local Ricci-style curvature of the activation manifold. Used to allocate larger ranks to layers with steeper local geometry. Capped at $k = \texttt{AXEX_MANIFOLD_K_MAX}$ above and at a per-model floor below. |
| FFN down | The matrix $W_\text{down} \in \mathbb{R}^{d \times d_\text{ffn}}$ that contracts the FFN intermediate space ($d_\text{ffn} = 14{,}336$ for Llama-3.1-8B) back to model dim $d = 4096$. Its singular-value spectrum is much flatter than the attention slots, so it gets a dedicated SVD path rather than the eigendecomposition route. |
| AXEX (in-house) | The runtime flag prefix for the GP machinery. --axex-compress, --axex-attn-only, --axex-skip-o, --axex-weight-pca, --axex-compress-rank N, --axex-kv. Defined in runtime/nn/axiom_exploit.h. |
What Paper 1 measures, what GP generalises
Paper 1 presents a deliberately narrow design: shared rank $k$ across layers, attention slots only ($Q$, $K$, $V$; $O$ skipped), no FFN compression, no quality fine-tune. The narrowness is the point, it lets the calibration-free claim and the super-baseline claim be tested without confounding from other compression knobs.
The runtime implements a more general scheme. The shape of the generalisation is summarised below. Each row is a knob in the production pipeline that Paper 1 holds fixed.
| Knob | Paper 1 setting | GP setting |
|---|---|---|
| Slots compressed | $Q$, $K$, $V$ only | $Q$, $K$, $V$, $O$, FFN up, FFN gate, FFN down (configurable) |
| Rank $k$ across layers | Shared | Per-layer (MCR/Ricci-driven, with a $k$ floor) |
| Decomposition | Eigendecomposition of $W W^\top$ | Same for $Q/K/V/O$ and FFN up/gate; SVD for FFN down |
| Build cost | Paid every run | Paid once; cached in ott_geometry.bin + W_proj |
| Cache validation | n/a | Depth-sink-layer eigenspectrum hash + weight-blob hash |
| KV-cache projection | Off | Optional (--axex-kv) |
Paper 1's measurement uses the GP code path with all but the first knob held in the simplest position. The numbers reported there therefore are GP numbers in the limit where GP collapses to attention-only fixed-rank. What GP buys above and beyond is (a) coverage of the FFN, which is where the bulk of Llama-3.1-8B's parameters live (FFN ≈ 70% of weights in this architecture), and (b) the cache architecture that makes the build cost a one-time charge instead of a per-run charge.
The honest position on FFN compression is that the runtime can compress it but that Paper 1 deliberately did not because the FFN's singular-value spectrum on Llama-3.1-8B is dramatically flatter than the attention slots'. We present the spectrum evidence in §3 and the consequence for compression quality in §7.
Why a low-rank basis exists at all
The premise of GP is that the local activation manifold of a trained transformer has intrinsic dimension much smaller than the ambient model dim $d$. If that premise is false, low-rank weight projection should be uniformly destructive. The manifold-extraction pipeline checked in under legacy/axiom_vis/ runs four phases on a model, manifold sampling, symmetry detection, curvature estimation, and axiom-set extraction, and emits per-phase JSON. We summarise the Phase 1 (intrinsic-dimensionality) and Phase 4 (axiom-set) outputs below for three open-weight models.
| Model | $d$ | Intrinsic dim (95% var) | $k_\text{int}/d$ | Samples | Axiom set size | Consistency |
|---|---|---|---|---|---|---|
| SmolLM2-135M | 576 | 17 | 2.95% | 64 | 24 / 96 | 0.921 |
| Gemma-4-E2B | 1,536 | 25 | 1.63% | 64 | 23 / 92 | 0.961 |
| Phi-3.5-mini | 3,072 | 11 | 0.36% | 64 | 22 / 96 | 0.959 |
Source data:
phase1_manifold.json
and
phase4_axioms.json
for each model under legacy/axiom_vis/<model>/.
"Consistency" is the Phase 4 self-consistency score: fraction of the discovered
axiom set that survives held-out rebuild on a disjoint sample.
What it is: three independent reproductions of the "low-intrinsic-dimensionality of LLM activations" finding [refs 1, 2, 7] on models that span an order of magnitude in $d$ (576 -> 3,072) and three architectures (Llama-style decoder, Gemma-style hybrid, Phi-style decoder). All three converge on $k_\text{int} \in \{11, 17, 25\}$, that is, $k_\text{int} \ll d$ uniformly, and $k_\text{int}$ does not grow with $d$.
What it is not: a guarantee that weight-space PCA of the individual $Q/K/V/O$ matrices captures this same intrinsic dimension. Activation-space intrinsic dim is necessary for the premise but not sufficient for the GP construction. The actual weight-space evidence (singular-value spectra of all seven slots on Llama-3.1-8B) is in §3: it strongly supports the GP construction for the four attention slots and quantitatively contradicts it for the three FFN slots, which is exactly the asymmetry Paper 1 leans on.
Why FFN-down gets a different code path
The seven compressible matrices in a Llama-style transformer block, with shapes for Llama-3.1-8B ($d = 4{,}096$, $d_\text{ffn} = 14{,}336$, $h = 32$ heads):
Attention Q : [d, d] = [4096, 4096]
K, V : [d_kv, d] = [1024, 4096] (GQA: 8 KV heads x 128)
O : [d, d] = [4096, 4096]
FFN W_up : [d_ffn, d] = [14336, 4096]
W_gate : [d_ffn, d] = [14336, 4096]
W_down : [d, d_ffn] = [4096, 14336]
Note: Llama-3.1-8B uses grouped-query attention, so $K$ and $V$ have only $d_\text{kv} = 1{,}024$ output rows (not $d = 4{,}096$). This is one reason rank $k = 1{,}024$ is a particularly natural Pareto knee: on $K$ and $V$ the projection becomes lossless (you cannot have more singular values than the smaller matrix dimension), so the only quality cost at $k = 1{,}024$ comes from $Q$ and (when included) $O$.
The attention slots, plus $W_\text{up}$ and $W_\text{gate}$, are tall in the same direction (fewer rows than columns is allowed too; what matters is that the "compressed" dimension is the one the basis is built over). For these slots GP uses the same construction as Paper 1: form the Gram matrix $K = W W^\top \in \mathbb{R}^{d \times d}$, take its top-$k$ eigenvectors $U \in \mathbb{R}^{d \times k}$, and store the projected weight $W_\text{proj} = W^\top U \in \mathbb{R}^{\text{out-dim} \times k}$.
$W_\text{down}$ is the one slot where this construction is unsatisfying. Its
eigenspectrum on Llama-3.1-8B is approximately flat over the first
$\sim$3,400 components, see Paper 1 §10, Figure 3, meaning that
truncating to $k = 1024$ drops more than 30% of its Frobenius energy on every layer
we measured. For this reason the production code path for $W_\text{down}$ uses a
direct
SVD rather than the eigendecomposition-of-Gram
route, and the runtime exposes --axex-skip-o /
--axex-attn-only flags that exclude $W_\text{down}$ (and $O$) entirely
when the user judges the FFN penalty to be the dominant quality cost. Paper 1
runs in exactly this configuration.
3.1 Per-slot spectra on Llama-3.1-8B (measured)
Singular-value spectra for all seven slots, computed by full SVD on the dequantised Q4_K_M weights at layers $\{0, 7, 15, 23, 31\}$, see scripts/analysis/compute_spectra.py and the JSON output at docs/figures/spectra_summary.json. All values are layer-means; ranges across the five layers are given where they are informative.
| Slot | Shape | Rank for 95% energy (layer-mean, range) |
Rank for 99% (layer-mean) |
Energy retained at $k=1{,}024$ (layer-mean) |
|---|---|---|---|---|
| $Q$ | [4096, 4096] | 1,682 (635–2,155) | 2,434 | 0.836 |
| $K$ | [1024, 4096] | 605 (253–724) | 802 | 1.000 (GQA, dim cap) |
| $V$ | [1024, 4096] | 809 (783–835) | 954 | 1.000 (GQA, dim cap) |
| $O$ | [4096, 4096] | 2,118 (1,947–2,342) | 2,868 | 0.743 |
| $W_\text{gate}$ | [14336, 4096] | 3,271 (3,199–3,304) | 3,840 | 0.539 |
| $W_\text{up}$ | [14336, 4096] | 3,360 (3,304–3,408) | 3,873 | 0.492 |
| $W_\text{down}$ | [4096, 14336] | 3,345 (3,293–3,407) | 3,863 | 0.490 |
The four attention slots are highly compressible: 95% of their Frobenius energy lives in $\sim 1{,}600$–$2{,}100$ of $4{,}096$ singular directions ($\sim 41\%$–$52\%$ of full rank), and at the production setting $k = 1{,}024$ they retain $74\%$–$84\%$ of their energy ($Q$, $O$) or $100\%$ ($K$, $V$, where GQA already caps the rank at $1{,}024$).
The three FFN slots are not. 95% of their energy needs $\sim 3{,}300$ of $4{,}096$ ($\sim 80\%$ of full rank), and truncating to $k = 1{,}024$ drops $\sim 50\%$ of their Frobenius energy on every layer measured. This is the quantitative version of the qualitative claim in Paper 1 ("FFN is flat"): the FFN slots' rank-for-95%-energy is roughly $\mathbf{2\times}$ that of the attention slots, and their retained energy at $k = 1{,}024$ is $\mathbf{\sim 0.5}$ vs $\mathbf{\sim 0.8}$ for attention. $W_\text{down}$ behaves like $W_\text{up}$ and $W_\text{gate}$ (mean rank 3,345 / energy retained 0.490), which is why the runtime gives $W_\text{down}$ the dedicated SVD path: not because its spectrum is qualitatively different from the other FFN slots, but because the Gram-matrix construction is numerically wasteful when the spectrum is approximately flat.
Caveat preserved: these spectra are weight-space, single model (Llama-3.1-8B), single quantisation (Q4_K_M dequantised). The cross-model activation-manifold evidence in §2 is independent of and does not extend to per-slot weight spectra on the other three models. We have not measured those.
Allocating the rank budget
A naive scheme assigns the same $k$ to every layer. Paper 1 uses this scheme.
GP exposes a heuristic that allocates more rank to layers with steeper local
geometry on the activation manifold. Concretely: at axiom-discovery time the
pipeline emits a per-layer Manifold-Curvature-Ratio (MCR) scalar, derived from a
coarse Ricci-style curvature estimate computed on the activation samples for that
layer's input. Layers with high MCR are allocated rank closer to
AXEX_MANIFOLD_K_MAX (currently $1{,}536$, see runtime/nn/axiom_exploit.h);
layers with low MCR are allocated less. A hard floor prevents pathological collapse:
k_layer = clamp(round(k_target * mcr_layer / mcr_mean),
k_floor,
AXEX_MANIFOLD_K_MAX)
The $k_\text{floor}$ matters more than the upper cap. Without it, the heuristic aggressively starves shallow layers in small models, and we observed an immediate decode-quality collapse on SmolLM2-135M ($d = 576$) when $k$ for any attention layer fell below approximately $0.4 d$. With the floor at $k_\text{floor} = 0.55 d$ (≈ 320 for SmolLM2; ≈ 845 for Gemma-4-E2B; ≈ 1,690 for Phi-3.5-mini, which exceeds the cap and so falls back to it), the failure mode disappears. This is documented here as a contingency of the heuristic, not as a measurement: we have not run a quality sweep that pins down the precise floor curve.
The MCR-driven per-layer rank assignment is the part of GP we are least confident about. It is enabled by default in the runtime, but Paper 1's headline numbers all use shared-rank ($k_\text{layer} = k$ for all $\ell$) because the MCR setting changes too many things at once for the calibration-free claim to remain clean. A future version of this paper will hold $k_\text{mean}$ fixed and compare shared-rank vs MCR-driven on PPL and decode tok/s on Llama-3.1-8B. That sweep does not yet exist.
Turning a minutes-long startup into a seconds-long startup
Building the GP bases from scratch on Llama-3.1-8B takes roughly 70 seconds on the reference RTX 4070 Laptop. This is dominated by the eigendecompositions of 32 layers × 7 slots = 224 Gram matrices. On the 70B target the same pass is extrapolated at ~70 minutes (80 layers × 7 slots, larger $d$). To make GP practical the runtime persists two artefacts:
ott_geometry.bin, per-layer spectra, axioms, depth-sink index, intrinsic-dim estimate, weight-blob hash, integrity hash. Small (a few MB).W_projcache (one file per slot per layer), the projected weights themselves. Large; for Llama-3.1-8B at $k=1{,}536$ this is the 1,093 MB figure cited in Paper 1.
Both caches are keyed on a hash of (model file digest, AXEX flag set, target rank). Mismatches force a full rebuild rather than risking a stale basis.
5.1 The depth-sink shortcut
Reloading the full geometry cache on a 70B model is not free, at minimum we want
to verify the cache is not stale. The runtime's solution is the depth-sink
layer: empirically, transformer activations saturate in effective dimensionality at
a single layer roughly two-thirds of the way through the stack, and the spectrum at
that one layer is enough to detect almost all weight-blob corruption. The cache
integrity check therefore reads only the depth-sink layer's spectrum from
ott_geometry.bin and recomputes it on the fly, which on Llama-3.1-8B
takes under 200 ms.
The depth-sink is identified during the original axiom run as the layer where the cumulative-explained-variance curve flattens to within $10^{-3}$ of its plateau. On the four models we have inspected, this is layer 21/32 (SmolLM2), layer 19/26 (Gemma-4-E2B), layer 22/32 (Phi-3.5-mini), and layer 22/32 (Llama-3.1-8B). The two-thirds rule is a property of these four models, not a theorem; we flag it explicitly because we don't know yet whether it holds for substantially deeper architectures.
Numbers GP shares with Paper 1, and what changes when knobs move
The end-to-end numbers GP produces on its empirical anchor, Llama-3.1-8B-Instruct Q4_K_M, RTX 4070 Laptop, locked 30-second cooldown protocol, are the Paper 1 numbers by construction, since GP at attention-only-shared-rank is the Paper 1 configuration. We summarise them here with the GP framing rather than the calibration-free framing.
| Configuration | Decode tok/s (% of baseline) | PPL (WikiText-2, 512 tok) | $W_\text{proj}$ disk |
|---|---|---|---|
| Baseline (no GP) | 100.00% | 6.7902 | , |
| GP attn-only, shared $k = 1{,}024$ | 106.27% | 10.9585 (+61.39%) | 729 MB |
| GP attn-only, shared $k = 1{,}536$ | 97.55% | 7.6936 (+13.30%) | 1,093 MB |
| GP attn-only, $k = 2{,}048$ requested | 101.04% | 7.6936 (+13.30%) | 1,093 MB |
All four PPL values measured under --ppl-eval on WikiText-2 first
512 tokens; raw outputs in
docs/figures/ppl_sweep/llama31_8b_ppl_sweep.json
(date 2026-04-22). Two structural observations: (i) $k=1{,}024$ collapses
quality (+61% PPL) because the K/V projection dimension on GQA-Llama-3.1-8B
is exactly $1{,}024$ , at $k=1{,}024$ the Q matrix is rank-deficient
against its full $4{,}096$-dim, while K and V are at the boundary; (ii)
$k=1{,}536$ and $k=2{,}048$ produce identical PPL because once
$k \ge 1{,}024$ the K and V matrices are full-rank (lossless), so
additional rank only affects Q, whose PCA energy is already saturated by
$k=1{,}536$. The $k=2{,}048$ disk footprint matches $k=1{,}536$ because
AXEX_MANIFOLD_K_MAX in
runtime/nn/axiom_exploit.h
silently caps the stored basis. Operationally: $k=1{,}536$ is the Pareto
rank for this model on this protocol , $k=2{,}048$ is wasted memory
and compute, $k=1{,}024$ is unsafe.
6.1 What is not in this table
- FFN-included configurations. The runtime supports them; the benchmark pass is gated on a 30-second-cooldown rerun and is not in the v0.2 pack.
- MCR-driven per-layer rank vs shared rank. Same gating.
- End-to-end numbers on SmolLM2-135M, Gemma-4-E2B, Phi-3.5-mini. These models loaded and ran under GP during the manifold-extraction pass; we observed qualitatively that decode remained coherent at attn-only $k = 0.6 d$ but did not produce paired-CI throughput or PPL numbers under the locked protocol. The honest claim is "GP runs without crashing on these four models", not "GP is measured on these four models."
- Llama-3.1-70B. Compute is approved, the run is queued; no numbers yet.
Failure modes and contraindications
7.1 Small $d$ (≤ 1024)
On SmolLM2-135M ($d = 576$) we observed text-quality collapse at any attention rank below roughly $0.4 d$. The mechanism we suspect: at small $d$, attention head dim $d_h = d/h$ is already small (≈ 64), so further compression eats into the per-head rank budget in ways that the per-block PCA does not preserve. The runtime defends against this with the $k_\text{floor}$ in §4; in practice this means GP at $k_\text{floor} = 0.55 d \approx 320$ on SmolLM2 is near-lossless and at $k = 200$ it is broken. There is no useful compression to be had on this model, the floor and the ceiling are too close together.
7.2 FFN-only configurations
The FFN-down spectrum is too flat for the eigendecomposition route. The runtime's SVD path makes the math work but does not change the underlying compressibility of the matrix. Empirically, FFN-only GP at $k = 1024$ on Llama-3.1-8B is not faster than baseline (it is, in the spot checks, slightly slower, since it adds a projection without saving enough bandwidth). We have not characterised this pattern with a full benchmark sweep and so we cite it here as observation rather than measurement.
7.3 The fp32 $W_\text{proj}$ format
The persistent $W_\text{proj}$ cache is currently stored as fp32. Converting it to Q8_0 would close most of the disk-footprint gap but would re-quantise weights that have already been quantised once (Q4_K_M dequantised -> fp32 -> Q8_0), and the behaviour at the second quantisation boundary is not characterised. This is the same concern Paper 1 §11 raises and applies identically here.
7.4 Cache invalidation cost
A weight-blob hash mismatch forces full re-axiom plus full $W_\text{proj}$ rebuild. On the 70B target this would cost ~70 minutes and ~14 GB of disk. Operationally this means model upgrades are an event, not a transparent rolling deploy.
Commands and expected output
The Paper 1 reproduction recipe at repro/REPRODUCE.md reproduces the empirical-anchor row of §6. The cross-model intrinsic-dim numbers from §2 are reproducible via:
# intrinsic-dim re-run for any of the three models in legacy/axiom_vis/
.\build_host\geodessical.exe <model.gguf> \
--axex-axiom-only --axex-export-vis <outdir>
# then read <outdir>/phase1_manifold.json
Tolerance: intrinsic-dim estimate is stable to ±1 across reseeded runs of the Phase 1 sampler at $n_\text{samples} = 64$; consistency score is stable to ±0.02. The actual axiom set is sample-dependent (random projections); only the size and the consistency are stable.
What this paper is missing before it is finished
- Multi-model end-to-end PPL and decode tok/s under the locked protocol (SmolLM2, Gemma-4-E2B, Phi-3.5-mini, plus 70B).
- FFN-included sweep on Llama-3.1-8B at fixed total parameter budget.
- MCR vs shared-rank A/B at matched mean-$k$.
- Per-slot spectra on the other three reference models (SmolLM2-135M, Gemma-4-E2B, Phi-3.5-mini); only Llama-3.1-8B is measured today (see §3.1).
- Cache-rebuild timing on 70B (currently extrapolated, not measured).
None of the above blocks the runtime from being usable today; they block the publication-grade version of the GP claim.
Selected refs
- Cai, T. et al., Isotropy in the Contextual Embedding Space (ACL 2021), activation low-dimensionality on language models.
- Razzhigaev, A. et al., The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models (EACL 2023), TwoNN/PCA intrinsic-dim across architectures.
- Eckart, C. and Young, G., The Approximation of One Matrix by Another of Lower Rank, Psychometrika (1936), optimality of truncated SVD.
- Hu, E. et al., LoRA: Low-Rank Adaptation of Large Language Models (ICLR 2022), low-rank decomposition of weight updates.
- Williams, S. et al., Roofline: An Insightful Visual Performance Model, CACM (2009), bandwidth/compute roofline.
- Touvron, H. et al., Llama 3 Herd of Models (2024), reference architecture.
- Park, K. et al., The Linear Representation Hypothesis (2023), supports the manifold premise of §2.
- Kimi Team, Block Attention Residuals (arXiv:2603.15031, 2026), cited for the AttnRes interaction discussed in Paper 3.