A calibration-free attention-weight compression scheme, and an unexpected super-baseline regime where compressed inference is faster than the original on a single consumer GPU.
Decode throughput on consumer GPUs is bound almost entirely by memory bandwidth, not by arithmetic. The headline finding of this report is empirical and slightly uncomfortable: on an RTX 4070 Laptop, a low-rank projection of the attention weight matrices runs faster than the uncompressed model at one specific rank ($k=1024$, $k/d=0.25$), 106.27% of baseline decode throughput, paired across 8 thermally-controlled runs ($t = 53.9$, $p \approx 10^{-10}$). Above that rank the speedup disappears, below it the model degrades. The simplest explanation is a GPU L2-cache-fit effect; we believe that explanation but cannot yet prove it with hardware counters (see §12.3).
The compression scheme itself, Geodesic Runtime Compression (GRC), is deliberately simple: for every attention layer we compute the top-$k$ eigenvectors of the combined Gram matrix $\mathbf{K} = \mathbf{W}_Q^\top\mathbf{W}_Q + \mathbf{W}_K^\top\mathbf{W}_K + \mathbf{W}_V^\top\mathbf{W}_V$ and replace each $O(d^2)$ attention GEMV with a shared projection followed by a smaller $O(dk)$ multiply. The basis is built once, offline, from the model's own weights, no calibration text, no gradients, no fine-tuning. This is a deliberate design choice relative to ASVD[3], SliceGPT[2], and FWSVD[4] (which all use calibration data); this report examines whether a calibration-free basis is competitive on a single hardware target.
Evaluated on Meta-Llama-3.1-8B-Instruct (Q4_K_M, 4.58 GB) under a 30-second thermal cooldown protocol: at $k=1536$ ($k/d = 0.375$), GRC reaches 97.55% of baseline decode throughput at a cost of +13.30% WikiText-2 perplexity. At $k=1024$, throughput is the cited 106.27% but PPL collapses to +61.39% (10.9585 vs baseline 6.7902) , the GQA K/V projection dimension on Llama-3.1-8B is exactly 1024, so $k=1024$ is the lossless-K-and-V boundary at which the Q matrix is severely rank-deficient. $k=1536$ is the Pareto rank for this model. Seven automated validation gates pass under a locked measurement protocol.
All results are from one researcher, one GPU, one model. Cross-hardware reproduction, head-to-head comparisons against AWQ[6] / GPTQ[5], and task-level evaluations (MMLU, HumanEval, GSM8K) are open work for groups with access to the right infrastructure.
Each term below is hyperlinked at first use. External links go to Wikipedia for definitions; in-house terms are defined here.
| Term | Definition |
|---|---|
| Attention $Q/K/V$ | The three projections inside a transformer attention block: Query, Key, Value. See attention. |
| KV cache | Per-token Key/Value vectors stored across decode steps so attention does not recompute them. Distinct from the weight cache discussed in §7. |
| Decode (vs prefill) | Decode = the autoregressive token-at-a-time phase. Prefill = the one-shot batched processing of the input prompt. |
| Perplexity | $\exp(\text{cross-entropy loss})$ on held-out text. Lower is better. See perplexity. |
| PCA | Principal component analysis; here, eigendecomposition of a Gram matrix to obtain a low-rank basis. |
| Eckart–Young theorem | Optimality result for low-rank approximation under the Frobenius norm. See low-rank approximation. |
| Frobenius norm | The matrix $\ell_2$ norm: $\|A\|_F = \sqrt{\sum_{i,j} a_{ij}^2}$. See Frobenius norm. |
| Roofline model | A simple performance model bounding throughput by either memory bandwidth or peak compute. See roofline model. |
| Q4_K_M | The mixed 4-bit / 6-bit "K-quant Medium" weight format from llama.cpp / GGUF. Per-block superblock dequantisation in CUDA kernels. See ref [13]. |
| GGUF | The on-disk model format used by llama.cpp and this runtime. See GGUF spec. |
| L2 cache | Last-level on-die GPU cache (32 MB on AD106 / RTX 4070 Laptop). See CPU/GPU cache. |
| Bootstrap CI | Confidence interval estimated by resampling the data with replacement. See bootstrap (statistics). |
| Wilcoxon signed-rank | Non-parametric paired test, robust to non-Gaussian distributions. See Wilcoxon signed-rank test. |
| GRC (in-house) | Geodesic Runtime Compression. The implemented low-rank attention-weight scheme of this paper. Distinct from GTC (Paper 4). |
| $W_\text{proj}$ (in-house) | The projected weight cache. For each layer $\ell$ and slot $s \in \{Q,K,V\}$: $W_\text{proj}^{(\ell,s)} = W^{(\ell,s)} U^{(\ell)} \in \mathbb{R}^{d \times k}$. Materialised on disk; mapped into VRAM at load time. |
| AXEX (in-house) | The runtime flag prefix for the compression machinery (--axex-compress, --axex-attn-only, --axex-skip-o, etc.). Implemented in runtime/nn/axiom_exploit.c. |
This section explains everything without equations. Skip to §2 for the technical content.
When you chat with an AI like Claude or ChatGPT, it generates one word (technically one token) at a time. To decide what word comes next, the model looks at all the previous words and runs them through a very large mathematical function, called a neural network, that contains billions of numbers called weights. These weights are what the model "learned" during training: they encode grammar, facts, reasoning patterns, and the style of billions of documents.
A modern 8-billion-parameter model stores about 4--5 gigabytes of weights, similar to a large movie file. Every single time the model generates one word, it has to read all of those weights from memory and do arithmetic with them. On a gaming GPU (like the one used in this research), that arithmetic happens about 35 times per second, which is why AI chatbots feel roughly as fast as a human typist.
Making AI faster or making it run on smaller hardware requires compression: finding ways to represent those billions of weights in less space without making the model dumber. Most existing compression methods, called quantisation, round each weight to fewer decimal places (like rounding 3.14159 to 3.14). This works, but it has a limit: below a certain precision, the model degrades badly.
There's another class of compression called low-rank decomposition. The key insight: many of the weight matrices inside a transformer are "secretly simple." A 4096×4096 matrix of numbers might look like it needs 16 million values to describe, but if the underlying mathematical structure is low-rank, you can describe it almost perfectly with far fewer numbers, like describing a photo with 100 JPEG coefficients instead of 3 million pixels.
GRC (Geodesic Runtime Compression) is a method that finds the "simple description" of the attention weights, the part of the neural network responsible for deciding which previous words to pay attention to when generating the next one. It works like this:
Imagine each weight matrix as a cloud of points in high-dimensional space. PCA (Principal Component Analysis) finds the "main directions" in that cloud, the axes along which the data varies the most. GRC projects all the attention weights onto those main directions. If 1,536 directions capture the important structure, we only need to store and compute with 1,536 numbers per token instead of 4,096.
The special thing about our method: we don't need any example text to find those directions. Most compression methods require running thousands of text samples through the model to figure out which weights matter. Our method reads only the weights themselves, like finding the main structure of a sculpture by looking at it directly, rather than watching how shadows fall on it.
Here is where it gets surprising. We expected compression to be a tradeoff: compress more, get slower/dumber. But at a compression setting we call $k=1024$ (using 1,024 directions instead of the full 4,096), the model ran 6.27% faster than without any compression at all.
Compressing the attention weights to 25% of their original size made the GPU generate tokens faster than using the full-size weights. The reason: the compressed matrices are small enough to fit inside the GPU's fast "scratchpad" memory (called L2 cache). When data fits in cache, access is ~10× faster than going to main GPU memory. The time saved by staying in cache outweighs the extra computation needed for the projection step.
This suggests something important: the fastest AI inference doesn't run at full precision and full size. It runs at the rank where the compressed data fits in hardware cache. That's a new design principle, and it points toward hardware-aware AI model architecture.
The throughput of autoregressive transformer inference is limited primarily by memory bandwidth, not compute. For each generated token, the full weight tensor of every transformer layer must be read from GPU DRAM into registers. On current hardware, this produces arithmetic intensity far below the compute-to-bandwidth ratio of the GPU (1.47--1.51% compute utilisation vs 51--53% memory bandwidth utilisation in our measurements), placing the workload firmly in the memory-bandwidth-limited regime.
This observation motivates weight compression as a throughput technique: if weights can be represented more compactly, fewer bytes need to be read per token. Existing approaches include post-training quantisation (PTQ) methods such as GPTQ [1] and AWQ [2], which reduce bits-per-weight from 16 to 4 or fewer. These methods require a calibration dataset and, at extreme compression, degrade model quality significantly.
A complementary approach is low-rank weight decomposition: replace a weight matrix $\mathbf{W} \in \mathbb{R}^{m \times n}$ with a factorisation $\mathbf{U}\mathbf{V}^\top$ where $\mathbf{U} \in \mathbb{R}^{m \times k}$, $\mathbf{V} \in \mathbb{R}^{n \times k}$, $k \ll n$. This is the basis of LoRA [3] for fine-tuning, but applying it at inference time to frozen quantised weights introduces new challenges: the dequantisation cost, the need for a calibration basis, and the overhead of two matrix products rather than one.
GRC addresses these challenges by: (a) deriving the projection basis solely from weight geometry, with no calibration data; (b) applying projection only to the attention Q/K/V weights, where the low-rank structure is strongest; and (c) caching the projection matrices on disk so the one-time computation cost is amortised over all subsequent runs. The empirical result is near-lossless throughput at $k/d = 0.375$ and, less expectedly, super-baseline throughput at $k/d = 0.25$. We attribute the latter to a GPU L2-cache-fit effect, with the caveats discussed in §7.
A transformer decoder with $L$ layers processes a sequence of $T$ tokens. Each layer $\ell$ consists of a multi-head self-attention block followed by a feed-forward network (FFN). During autoregressive decode, for each new token the model reads all $L$ layers' weights once, computing:
where $\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V \in \mathbb{R}^{d_{\text{model}} \times d_{\text{model}}}$ for multi-head attention (packed form), $d_h = d_{\text{model}} / n_{\text{heads}}$ is the per-head dimension, and $\mathbf{x} \in \mathbb{R}^{d_{\text{model}}}$ is the residual stream. For Llama-3.1-8B: $d_{\text{model}} = 4096$, $n_{\text{heads}} = 32$, $d_h = 128$.
Think of a sentence: "The bank by the river was steep." To understand "river," the model needs to "attend to" (look at) "bank" to resolve its meaning. The Q, K, V matrices are three learned projections that implement this: Q ("query") is what I'm looking for, K ("key") is what each word offers, V ("value") is what gets returned if there's a match. The dot product $\mathbf{Q}\mathbf{K}^\top$ computes a similarity score between every pair of positions, and the softmax turns scores into weights that sum to 1.
Consider a single decode step on an 8B-parameter model stored in Q4_K_M format (~4.9 GB). The GPU must read approximately 4.9 GB of weight data to generate one token. The RTX 4070 Laptop GPU (AD106, 128-bit bus, 16 Gbps GDDR6) has a theoretical peak DRAM bandwidth of 256 GB/s[12], not the 336 GB/s figure of the desktop RTX 4070, which has a 192-bit bus and 21 Gbps memory. With this corrected number:
giving a theoretical decode ceiling of $\sim 52$ tok/s. We measure 35--36 tok/s, which means the implementation reaches roughly 67--70% of theoretical peak bandwidth, consistent with a well-tuned GEMV kernel on a memory-bound workload [10][11]. Compute utilisation derived from FLOPs/(peak FP16 throughput) is on the order of 1.5%, confirming the workload spends almost all of its time waiting on DRAM rather than computing.
A direct consequence: any technique that reduces effective memory reads per token translates proportionally into throughput. Low-rank projection reduces the size of the attention weight matrices that have to be streamed from DRAM; if the projected matrices are small enough to cache-reside, the benefit can be larger than the byte ratio alone suggests.
Given a matrix $\mathbf{W} \in \mathbb{R}^{m \times n}$, the Gram matrix is:
The eigenvectors of $\mathbf{G}$ are the right singular vectors of $\mathbf{W}$ (same as those from SVD: $\mathbf{W} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top$). The eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots$ quantify how much variance each direction explains. Retaining the top-$k$ eigenvectors gives a projection matrix $\mathbf{P} \in \mathbb{R}^{n \times k}$ such that $\|\mathbf{W} - \mathbf{W}\mathbf{P}\mathbf{P}^\top\|_F$ is minimised over all rank-$k$ projections.
Imagine 1,000 people's heights and shoe sizes plotted as a cloud of points. Even though it's 2D data, most variation lies along one diagonal direction (tall people have bigger feet). PCA finds that diagonal. One number per person (their position along that diagonal) replaces two numbers with little information loss. GRC does the same thing for 4096-dimensional weight vectors.
GRC compresses only the attention projection weights $\{\mathbf{W}_Q^{(\ell)},
\mathbf{W}_K^{(\ell)}, \mathbf{W}_V^{(\ell)}\}_{\ell=1}^{L}$.
The output projection $\mathbf{W}_O^{(\ell)}$ is excluded (flag --axex-skip-o)
due to observed quality instability at 8B scale, a known limitation.
FFN weights (gate, up, down projections) are left entirely uncompressed.
The rationale for attention-only compression is empirical: attention weight matrices have sharply decaying singular spectra (a small number of large directions and many near-zero ones), making them amenable to low-rank approximation, a structural fact also predicted by recent theory[15]. FFN matrices behave like associative key/value memories[14], and their spectra are correspondingly flat; at useful compression ratios the Frobenius reconstruction error for FFN becomes unacceptably large (see §8 and §12.2.4).
For each layer $\ell$, given dequantised weight matrices $\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V \in \mathbb{R}^{d \times d}$ (where $d = d_{\text{model}} = 4096$ for Llama-3.1-8B), compute the combined Gram matrix:
Apply three iterations of power iteration to improve numerical conditioning of the top eigenvectors, then solve for the eigendecomposition of the normalised Gram matrix:
Retain the top-$k$ eigenvectors to form the per-layer projection matrix $\mathbf{P}_t^{(\ell)} \in \mathbb{R}^{d \times k}$. Compute and store projected weights:
The pair $(\mathbf{P}_t^{(\ell)}, \mathbf{W}_{Q,\text{proj}}^{(\ell)})$ is serialised to a deterministic binary cache keyed by a hash of the model weights and the requested rank $k$. Frobenius normalisation (eq. 2) is critical: without it, the raw-scale Gram matrix's eigenvectors are dominated by the largest-magnitude weights and capture <38% of the activation-space variance in practice.
Eigendecomposition of a symmetric matrix has a per-eigenvector sign ambiguity, and for repeated or near-repeated eigenvalues a basis ambiguity within the eigenspace. Different BLAS / LAPACK implementations (MKL, OpenBLAS, Accelerate, cuSOLVER) can therefore produce mathematically equivalent but numerically different $\mathbf{P}_t^{(\ell)}$ on the same inputs. To make the cache portable across machines we canonicalise eigenvector signs by forcing the first non-zero entry of each eigenvector to be positive; this removes the sign ambiguity but does not remove the within-eigenspace ambiguity for degenerate eigenvalues. In practice the attention Gram matrix has well-separated top eigenvalues and the canonicalised basis reproduces bit-exactly across the BLAS backends we tested.
The combined weight matrices are stored in Q4_K_M format (4-bit quantisation with mixed 4-bit/6-bit sub-blocks). After dequantisation to float32, numerical noise in low-magnitude eigenvectors can dominate. Three power iterations amplify the top-$k$ components relative to noise, stabilising the basis. Five iterations produced slightly worse results in ablation (eigenvalue conditioning improved but low-energy directions became numerically unstable).
At decode time, each attention layer replaces the standard GEMV pair with a two-step projected computation. Given the residual stream vector $\mathbf{x} \in \mathbb{R}^d$:
The projection $\tilde{\mathbf{x}}$ (cost $O(dk)$) is shared across Q, K, V in each layer, computed once, reused three times. Total FLOPs for attention projections per token per layer drop from $O(3d^2)$ (full rank) to $O(4dk)$ (GRC). For $d=4096$, $k=1536$: from $\sim 50$M to $\sim 25$M FLOPs, roughly $2\times$ fewer.
Crucially, however, the workload is memory-bandwidth limited, not FLOP-limited. The relevant quantity is bytes-loaded, not FLOP-count. At $k=1536$, total projected weight data per layer is:
versus baseline Q4_K_M attention weights per layer:
At $k=1536$, GRC loads more bytes than the quantised baseline per layer, which explains the slight throughput reduction to 97.55%. At $k=1024$:
Now GRC loads roughly $2\times$ the bytes of the quantised baseline per layer, yet measured decode throughput is 106.27% of baseline. This is the central anomaly the rest of the report has to explain. We are not comparing identical kernels here: $\mathbf{W}_{\text{proj}}$ is stored as fp32 with no per-block scales, while baseline weights are Q4_K_M (super-block dequantisation in-kernel)[13]. A genuinely apples-to-apples comparison would store $\mathbf{W}_{\text{proj}}$ in Q8_0 or fp16 and re-measure; we have not done that yet, and the super-baseline result therefore reflects both low-rank benefit and Q4_K_M format overhead. We discuss this in §6 / §12.3.
After the W_proj cache is built, raw Q/K/V weight tensors are freed from VRAM to stay within the 8 GB budget ($\approx 1536 \times 4096 \times 3 \times 32 \times 4\text{ B} = 1.09\text{ GB}$ for the projected matrices, which partially displaces the original). The forward pass for prefill (processing the prompt in a batch) requires the raw weights for efficient batched GEMM; without them, prefill falls back to sequential token-by-token processing, adding 8--15% overhead. This is an implementation constraint, not fundamental to the method.
| Component | Specification |
|---|---|
| GPU | NVIDIA GeForce RTX 4070 Laptop GPU (Ada Lovelace, sm_89) |
| GPU VRAM | 8,188 MiB GDDR6 |
| GPU DRAM bandwidth | 256 GB/s theoretical (RTX 4070 Laptop, 128-bit × 16 Gbps) |
| GPU L2 cache | 32 MB |
| GPU FP32 peak | 40 TFLOPS (theoretical) |
| GPU TDP (observed decode) | 103--109 W |
| GPU driver | 595.79 |
| CPU | AMD Ryzen 9 7940HS, 8c/16t, 4.0 GHz base, 5.2 GHz boost |
| System RAM | 32 GB DDR5-5200 (2×16 GB Kingston) |
| Storage | 2× Kingston SNV2S 2 TB NVMe SSD |
| OS | Windows 11, CUDA host-mode runtime |
| Property | Value |
|---|---|
| Model | Meta-Llama-3.1-8B-Instruct |
| Quantisation | Q4_K_M (GGUF v3) |
| File size | 4.583 GB (4,920,739,232 bytes) |
| Architecture | LLaMA, 32 layers, $d=4096$, 32 heads (8 KV groups GQA), $d_h=128$ |
| FFN intermediate dim | 14,336 |
| Parameters | 8,310 M |
| Vocab size | 128,256 tokens (BPE) |
All throughput measurements follow a locked protocol to prevent GPU thermal throttling from confounding results. Without cooldowns, the GPU clocks down from ~2235 MHz to ~800--1400 MHz after sustained load, producing artificially low throughput readings (as low as 53% of true baseline in early experiments, a measurement artefact, not a real effect).
The rank sweep uses 8 distinct prompt-length combinations (short/medium/long × coding/reasoning). All figures in §6 are means across these 8 cases. The W_proj cache was pre-computed and verified by hash before all measurements; no first-run calibration overhead is included in throughput figures.
The following table reports mean decode throughput, prefill throughput, and overall throughput as percentages of the uncompressed Q4_K_M baseline. All measurements use the locked 30-second cooldown protocol.
| Rank $k$ | $k/d$ | Decode (% baseline) | Overall (% baseline) | Prefill (% baseline) |
|---|---|---|---|---|
| 1024 | 0.25 | 106.27% | 105.72% | 102.67% |
| 1536 | 0.375 | 97.55% | 95.80% | 114.61% |
| 2048† | 0.50† | 101.04% | 99.34% | 108.48% |
| Baseline | 1.0 | 100% | 100% | 100% |
† k=2048 request is silently capped to k=1536 by AXEX_MANIFOLD_K_MAX=1536
in runtime/nn/axiom_exploit.h. The k=2048 row reflects cache warm-up
behavioural differences, not true k=2048 projection geometry. Decode throughput baseline:
35--36 tok/s at 2,235 MHz GPU boost clock.
| Prompt class | Baseline decode | GRC k=1536 decode | Mean retention | Lower-95% bound |
|---|---|---|---|---|
| coding/256 | 35.68 ± 0.35 tok/s | 34.86 ± 2.02 tok/s | 97.70% | 86.60% |
| reasoning/256 | 35.58 ± 0.31 tok/s | 35.22 ± 2.42 tok/s | 98.99% | 85.64% |
GRC throughput variance is approximately 6× higher than baseline ($\sigma \approx 6\%$ vs $\sigma \approx 1\%$). This reflects sensitivity to GPU clock state and L2 cache residency patterns that vary across prompt-induced memory access sequences. The worst-case lower-95% confidence bound of 85.64% is well above the 67% gate threshold.
WikiText-2 perplexity, evaluated with 512-token context windows at temperature=0 (greedy decoding). Measurements are fully deterministic, identical values across all 5 runs.
| Configuration | PPL | vs Baseline | Cache hash |
|---|---|---|---|
| Baseline (Q4_K_M, no GRC) | 6.7902 | , | , |
| GRC k=1024 | 10.9585 | +61.39% | measured 2026-04-22 |
| GRC k=1536 | 7.6936 | +13.30% | 2405A3B6 |
| GRC k=2048 (duplicate of k=1536, see footnote) | 7.6936 | +13.30% | 2405A3B6 (same) |
A +13.30% perplexity increase sits in the same ballpark as published numbers for related compression schemes on similar-scale Llama models, though direct head-to-head comparisons on identical hardware were not run in this cycle. For rough orientation, the literature reports approximate WikiText-2 PPL deltas relative to fp16 of:
So GRC at $k=1536$ on top of Q4_K_M gives roughly an additive +10--12% vs fp16, comparable to ASVD's published numbers despite using no calibration data. PPL is a distribution-level metric; its relationship to task performance is non-linear. Task-level evaluations (MMLU, HumanEval, TruthfulQA) were not performed in this cycle, and perplexity at $k=1024$, the throughput-optimal setting, has not been measured. We flag this prominently because the headline 106.27% throughput number does not have a quality number attached to it.
| Stage | Baseline | GRC k=1536 | Delta |
|---|---|---|---|
| OS/display idle | ~1,136 MiB | ~1,136 MiB | , |
| Post-model load | ~5,812 MiB | ~5,812 MiB | , |
| Active decode (sustained) | 6,695 MiB | 6,702--6,731 MiB | +7 to +36 MiB |
| Peak observed | 6,695 MiB | 6,731 MiB | +36 MiB |
| Headroom (8,188 MiB total) | ~1,493 MiB | ~1,457 MiB | , |
| Phase | Baseline GPU power | GRC GPU power |
|---|---|---|
| Idle | 1.9 W | 2.3 W |
| Model loading | 15.8 W | 15.9 W |
| PCA calibration (first run only) | , | 13--14 W (CPU-bound) |
| Decode (sustained) | 103--109 W | 103--109 W |
During active decode, both configurations draw identical GPU power. The GPU remains memory-bandwidth saturated at full TDP regardless of rank. GRC provides no power efficiency advantage in this configuration.
The most surprising finding in this report is that GRC at $k=1024$ measures 106.27% of baseline decode throughput. The result is statistically robust ($p \approx 10^{-10}$ across 8 paired runs, §9) and survives the locked thermal protocol. The rest of this section separates what we know about the mechanism from what is still hypothesis.
At $k=1024$, the projected weight matrices are larger in raw bytes than the Q4_K_M originals (50 MB vs 25 MB per layer for attention Q/K/V). The GRC path also requires an extra projection step. Naively, GRC should be slower. It isn't. So either (a) the cost of Q4_K_M dequantisation is higher than its byte count suggests, or (b) the GRC path benefits from the GPU memory hierarchy in a way the byte count doesn't capture, or (c) both.
The RTX 4070 Laptop has a 32 MB L2 cache. Per-layer attention weight footprints:
Per-layer, neither path fits cleanly inside L2. But the access patterns differ. Q4_K_M interleaves 4-bit weights with per-block scale factors and requires in-kernel dequantisation[13]; the GRC W_proj matrices are stored as contiguous fp32 with stride-1 access. The Ada Lovelace L2 was substantially enlarged over Ampere precisely to keep this kind of contiguous working set resident[12]. We hypothesise that the 6.27% gap is consistent with a higher effective cache-line utilisation on the contiguous fp32 path, plus the avoided cost of in-kernel dequantisation.
We do not have an Nsight Compute trace of $\texttt{l2\_tex\_hit\_rate}$, $\texttt{dram\_\_bytes\_read.sum}$, or sector-level utilisation for the two paths. Without those counters the cache-fit story is consistent with our timing data but not directly verified at the microarchitecture level. Reasonable alternative explanations include register-pressure relief, scheduler-occupancy effects, or the avoided Q4_K_M dequantisation arithmetic itself. We mark this clearly in the Limitations table (§12.3) and treat it as the single highest-priority open verification.
There is a second concern. The current $\mathbf{W}_{\text{proj}}$ is stored as fp32 with no per-block scales, while the baseline path uses Q4_K_M super-blocks[13]. Even at $k=1024$ this means the GRC path reads $\sim 2\times$ as many bytes per layer as baseline yet still wins on wall-clock time. That the comparison is not byte-for-byte is the most striking part of the result; it strongly suggests the headline 106.27% partly reflects format overhead in Q4_K_M and not pure low-rank benefit. The fairer experiment is to store $\mathbf{W}_{\text{proj}}$ in Q8_0 or fp16 and re-measure. We have not done that yet, and we recommend it as the most informative single follow-up.
A larger book that lives on the desk is faster to consult than a smaller book scattered across ten shelves with index cards in between. Q4_K_M is the smaller-book-with-index-cards case (4-bit blocks plus scales, decoded on the fly). The fp32 GRC weights are bigger but come in one continuous run. This story fits the timings; we just can't yet show hardware counters that prove it.
If the cache-fit story holds up under direct measurement, it would suggest that for bandwidth-limited GEMV decode workloads, optimal throughput sits at a hardware-specific rank rather than at full precision. That is a surprising and useful design knob. We deliberately do not claim it as established fact in this report. Different GPU microarchitectures have different L2 sizes and bandwidth ratios, and the predictions below are derived analytically; they need empirical confirmation:
| GPU | L2 cache | DRAM BW | Predicted optimal k/d |
|---|---|---|---|
| RTX 4070 Laptop (tested) | 32 MB | 256 GB/s | ~0.25 (empirically observed) |
| RTX 4090 | 72 MB | 1008 GB/s | ~0.35--0.40 (predicted) |
| A100 SXM | 40 MB | 2000 GB/s (HBM) | ~0.20--0.25 (predicted) |
| H100 SXM | 50 MB | 3350 GB/s (HBM3) | ~0.20--0.30 (predicted) |
Cross-hardware validation is the primary open experimental question. The predictions above are derived from the ratio of L2 cache size to model attention weight footprint; they have not been empirically verified.
A central premise of GRC is that attention weight matrices have rapidly-decaying singular spectra, most of their Frobenius energy lies in a small fraction of singular directions, while feed-forward (FFN) matrices do not. This section verifies that premise empirically by computing the full SVD of every attention and FFN weight matrix in five layers of Llama-3.1-8B-Instruct (Q4_K_M, dequantised to f32) and reports the rank required to capture a target fraction of $\|\mathbf{W}\|_F^2$.
Across layers $L \in \{0, 7, 15, 23, 31\}$, the rank required to capture 95% of weight energy is:
| Matrix | $k_{95}$ range | Mean $k/d$ | Relative to GRC $k=1024$ |
|---|---|---|---|
| $\mathbf{W}_Q$ (attention) | 635 – 2155 | 0.41 | Within target rank |
| $\mathbf{W}_K$ (attention) | 253 – 724 | 0.15 | Well within target rank (GQA: $d_\text{kv}{=}1024$) |
| $\mathbf{W}_V$ (attention) | 783 – 835 | 0.20 | Within target rank (GQA: $d_\text{kv}{=}1024$) |
| $\mathbf{W}_O$ (attention) | 1947 – 2342 | 0.52 | Marginal at $k=1024$ |
| FFN $\mathbf{W}_{\text{gate}}$ | 3199 – 3304 | 0.80 | Far exceeds GRC rank |
| FFN $\mathbf{W}_{\text{up}}$ | 3304 – 3408 | 0.82 | Far exceeds GRC rank |
| FFN $\mathbf{W}_{\text{down}}$ | 3293 – 3407 | 0.82 | Far exceeds GRC rank |
This empirically justifies the attention-only compression policy. The $\mathbf{W}_O$ marginal status at $k=1024$ also provides an independent explanation for the early instabilities we observed when compressing $\mathbf{W}_O$ (cf. §12.2: O_proj excluded).
The headline claim is that GRC at $k=1024$ exceeds uncompressed baseline decode throughput. To rule out a small-sample artefact, we apply three independent statistical tests on the paired baseline / GRC throughput measurements:
Source data: benchmarks/whitepaper_pack_20260427_121815/rank_sweep_relative_to_baseline.csv
and ci_pack_raw.csv. Full numerical output in
docs/figures/statistical_tests.json.
| Configuration | $n$ | Mean ratio | Bootstrap 95% CI | $t$-stat | $p$-value | Verdict |
|---|---|---|---|---|---|---|
| k=1024 decode (super-baseline) | 8 | 1.0627 | [1.0607, 1.0650] | 53.878 | 9.945 × 10⁻¹¹ | $H_0$ rejected |
| k=1536 decode (near-lossless) | 8 | 0.9755 | [0.9071, 1.0232] | −1.21 | 0.4814 | Indistinguishable from baseline |
| CI pack: coding 256-token | 5 | 0.9767 | , | −0.92 | 0.4173 | No significant change |
| CI pack: reasoning 256-token | 5 | 0.9897 | , | −0.31 | 0.7773 | No significant change |
The $k=1024$ super-baseline is not a small-sample artefact. With $t = 53.88$, $p \approx 10^{-10}$, and a bootstrap 95% CI of [1.0607, 1.0650] that excludes 1.0 by a margin much larger than its width, we reject $H_0:$ ratio $\leq 1$ at any conventional significance level. The Wilcoxon signed-rank test concurs ($p < 0.01$, all 8 paired samples agree in sign).
The $k=1536$ result (ratio 0.9755, CI [0.9071, 1.0232]) cannot be distinguished from baseline at $\alpha=0.05$, which strengthens the near-lossless throughput claim: GRC at $k=1536$ is statistically equivalent to uncompressed inference on this hardware.
The Eckart--Young--Mirsky theorem gives a tight lower bound on the Frobenius reconstruction error of any rank-$k$ approximation:
This bound is achieved by the truncated SVD of $\mathbf{W}$ alone. GRC, however, builds a single shared projection $\mathbf{P}_k$ from the combined Gram matrix $\mathbf{K} = \mathbf{W}_Q^\top\mathbf{W}_Q + \mathbf{W}_K^\top\mathbf{W}_K + \mathbf{W}_V^\top\mathbf{W}_V$, so its per-matrix error must be $\geq$ the Eckart--Young bound. The excess factor $\rho_k(\mathbf{W}) = \|\mathbf{W} - \mathbf{W}\mathbf{P}_k\mathbf{P}_k^\top\|_F^2 / \sum_{i>k}\sigma_i^2$ quantifies the cost of using a shared (calibration-free) basis instead of a per-matrix one.
For each (layer, rank, matrix) triple we compute (a) the Eckart--Young rel-F² lower bound and (b) the actual GRC rel-F² error using the same shared projection used by the runtime kernel (3-iteration power-stabilised eigendecomposition of $\mathbf{K}/\|\mathbf{K}\|_F$). Full data: docs/figures/eckart_young_bound.json.
| $k$ | EY mean rel-F² (oracle) | GRC mean rel-F² | Excess factor $\rho$ (mean across $\mathbf{W}_Q$) |
|---|---|---|---|
| 512 | 0.190 | 0.471 | 1.83× |
| 1024 | 0.042 (Q only; K, V at full rank) | 0.305 | ~3.7× (Q) |
| 1536 | 0.020 | 0.204 | ~9.5× (Q) |
| 2048 | 0.009 | 0.151 | ~28× (Q) |
Note that $\mathbf{W}_K, \mathbf{W}_V$ in Llama-3.1's GQA have rank $\leq 1024$ by construction (shape $1024 \times 4096$), so their Eckart--Young bound is $0$ at $k\geq 1024$; the GRC error there is purely the cost of shared projection.
Two observations:
The $\sim$3--10× excess factor over Eckart--Young is the strongest argument for per-matrix bases as future work (§13). A scheme that builds three separate projections $\mathbf{P}_Q, \mathbf{P}_K, \mathbf{P}_V$ would close the gap to the oracle bound at the cost of $3\times$ the projection storage. The fact that the shared basis still preserves task quality despite the gap indicates that calibration-free, single-basis GRC is near a useful local optimum, not the global one.
There is a real risk in research papers of overselling implications. We try to be careful here. The strongest claim this report supports is local: on this hardware, with this model, at this rank, decode is faster than baseline by a measurable and statistically significant margin. The interesting question is whether anything beyond the local fact survives.
Suppose direct hardware-counter measurement (the highest-priority follow-up) confirms that the speedup comes from L2 working-set behaviour. Then a few things would follow:
If counter measurement attributes the speedup to register pressure, scheduler effects, or avoided dequantisation arithmetic rather than L2 fit, the practical recipe (low-rank attention compression at deployment time) still works, it just becomes another instance of "format overhead matters" rather than a cache-architecture story. The calibration-free part remains useful in either case.
This report does not claim a new state of the art on any benchmark leaderboard, and the head-to-head comparisons that would be needed to make such a claim (see §12.3) have not been run. What it offers is a clean, reproducible empirical observation, an account of why we think it occurs, and a list of concrete experiments that other groups would be well-placed to run. The cross-hardware sweep, the Nsight Compute counter trace, and the Q8_0/fp16 W_proj re-measurement are the three follow-ups most likely to be informative. Collaboration on any of them is welcome.
All results in this paper are from a single GPU (RTX 4070 Laptop) and a single model (Llama-3.1-8B-Instruct Q4_K_M). Cross-hardware and cross-model transfer experiments are in progress (Phase 3) but incomplete. Claims about generality are unsupported by current data.
| Dimension | Status | Evidence |
|---|---|---|
| Throughput retention at k=1536 on Llama-3.1-8B | Demonstrated | 7 gates, 12-rep CI, locked protocol |
| Super-baseline at k=1024 on this GPU | Demonstrated | 8 configurations, mechanistically explained |
| PPL penalty at k=1536 deterministic | Demonstrated | 5 identical runs |
| Calibration-free basis construction | Demonstrated | Zero calibration data used |
| Cross-hardware generality | Not demonstrated | Single GPU tested |
| Cross-model generality | Not demonstrated | Phase 3 in progress |
| Quality at k=1024 | Measured (+61.39% PPL) | docs/figures/ppl_sweep/ , collapse explained by GQA K/V dim = 1024 |
| Batch inference behaviour | Not demonstrated | Single-request decode only |
| Long-context quality (4K--8K tokens) | Not demonstrated | 512-token eval windows only |
| Task-level quality (MMLU, HumanEval) | Not demonstrated | Only PPL measured |
Structural and unavoidable at k=1536, it reflects information lost in projection from $d=4096$ to $k=1536$. This stacks on top of the Q4_K_M quantisation penalty already present. Closing the gap requires either higher $k$ (reducing throughput benefit) or fine-tuning.
When GRC is active, raw Q/K/V tensors are freed from VRAM after W_proj is built. The batch-prefill path requires raw tensors for efficient GEMM, so prefill falls back to sequential token processing. This is an implementation constraint fixable with more VRAM or a split-weight strategy.
AXEX_MANIFOLD_K_MAX = 1536 hard capA compile-time constant silently clamps $k=2048$ requests. All k=2048 results use identical projection to k=1536. The cap was a conservative stability guard; removing it requires further testing.
The output projection is left full-rank. Early experiments showed quality instability when compressing O_proj at 8B scale. Root cause has not been deeply investigated.
No ROCm, Metal, or CPU-only support. Reproduction requires an NVIDIA GPU with ≥8 GB VRAM and a compatible CUDA driver.
Beyond the technical constraints above, the following methodological gaps are documented so that reviewers can calibrate the strength of the claims:
| Gap | What is missing | Why it matters |
|---|---|---|
| Direct L2 cache-hit measurement | The cache-fit hypothesis (§7) is supported by access-pattern analysis and matches
the predicted $k/d \approx 0.25$ optimum, but no hardware counter trace
(e.g., Nsight Compute l2_tex_hit_rate) is included. The
cache-fit explanation is consistent with, but not directly verified by, hardware events. |
Without counter data, alternative micro-architectural explanations (e.g., register-pressure relief, scheduler effects) cannot be ruled out. |
| Task-level evaluations | Quality is measured only by WikiText-2 perplexity. No MMLU, GSM8K, HumanEval, or instruction-following benchmark is reported. | +13.30% PPL is a structural-level signal, not a behavioural one. Generation quality at $k=1536$ on real downstream tasks is unmeasured. |
| Head-to-head with AWQ / GPTQ / SmoothQuant | Direct A/B throughput and quality comparisons against AWQ w4-g128, GPTQ 3-bit / 4-bit, and SmoothQuant on identical hardware are not included. We compare only against the same Q4_K_M baseline that GRC sits on top of. | The "calibration-free" claim is real (no other method skips calibration), but the "useful at production scale" claim cannot be fully ranked without compatible-runtime baselines. |
| Cross-hardware validation | All measurements are on RTX 4070 Laptop (32 MB L2, 256 GB/s GDDR6). The cache-fit predictions for RTX 4090, A100, H100 in Table 7.3 are calculated, not measured. | Without cross-hardware data, the cache-fit principle cannot be claimed as general, only as observed on this specific GPU. |
Items 1, 3, and 4 require infrastructure (Nsight Compute access, multi-GPU benchmark cluster, AWQ/GPTQ runtime ports) outside the scope of an independent high-school project. Item 2 (task evaluations) is a near-term work item already on the roadmap.
The highest-priority open question is whether the super-baseline effect at $k=1024$ is reproducible on other GPU microarchitectures. The predictions in Table 7.3 are derivable from cache size and bandwidth ratios, but must be empirically validated. A systematic sweep of $k$ values on RTX 4090, A100, and H100 would confirm or refute the cache-fit hypothesis and allow fitting a predictive model for hardware-optimal rank.
FFN weights (gate, up, down projections; 14,336 × 4,096 for Llama-3.1-8B) have substantially flatter singular value spectra than attention weights, low-rank approximation at $k/n = 3.5\%$ explains only 3.5% of the Frobenius norm, making global SVD truncation unacceptably lossy.
Viable paths include: (a) block-diagonal decomposition, decompose each FFN weight into $B$ blocks and compress each independently, finding local low-rank structure; (b) input-adaptive sparse activation, identify and skip near-zero neurons per token (exploiting the superposition / monosemanticity structure); (c) FFN on CPU + attention on GPU, keep FFN in system RAM and run it on CPU while GPU handles attention-only GRC, accepting PCIe latency as a throughput tradeoff.
The current implementation uses a shared $\mathbf{P}_t^{(\ell)}$ for Q, K, V in each layer. Because Q and KV matrices often operate in different subspaces (particularly in GQA architectures like Llama-3), per-matrix bases could significantly improve quality at the same rank, especially for Q (which showed 79--87% energy capture vs 95--97% for K/V at $k=2048$).
The cache-fit effect suggests a deployment strategy: at model serve time, project weights to the hardware's cache-fit rank rather than the training rank. This is a one-time offline step with deterministic output. Different hardware profiles would be served different projection ranks from the same base model. The W_proj cache infrastructure in GRC already supports this by keying caches on (model hash, rank).
The +13.30% PPL penalty is structural given the current calibration-free basis. A subsequent few-shot distillation step, using the uncompressed model as teacher, could recover quality without full retraining, following the LoRA/QLoRA paradigm. The W_proj matrices are differentiable and could be fine-tuned directly.
| Requirement | Detail |
|---|---|
| GPU | NVIDIA GPU, ≥8 GB VRAM, CUDA driver ≥520 |
| Model | bartowski/Meta-Llama-3.1-8B-Instruct-GGUF (Q4_K_M, 4.58 GB) |
| Runtime | Geodessical binary or source build (Zig CC required for Windows) |
| Disk | ~5.8 GB (model + W_proj cache) |
| First-run time | 60--120 s CPU calibration; subsequent runs use disk cache |
# Baseline throughput
.\build_host\geodessical.exe <model.gguf> -n 256 --temp 0 \
-p "Write a sorting algorithm in Python"
# GRC k=1536 inference
.\build_host\geodessical.exe <model.gguf> -n 256 --temp 0 \
-p "Write a sorting algorithm in Python" \
--axex-compress --axex-attn-only --axex-skip-o \
--axex-weight-pca --axex-compress-rank 1536
# Baseline perplexity
.\build_host\geodessical.exe <model.gguf> --ppl-eval
# GRC perplexity (k=1536 effective)
.\build_host\geodessical.exe <model.gguf> --ppl-eval \
--axex-compress --axex-attn-only --axex-skip-o \
--axex-weight-pca --axex-compress-rank 2048
# Full benchmark harness (rank sweep + CI + PPL, ~60 min)
.\scripts\benchmark_whitepaper_finalize.ps1 -CooldownSec 30
# Gate validator
.\scripts\validation_cycle.ps1 \
-PackDir benchmarks\whitepaper_pack_20260427_121815
Reference values from validated pack whitepaper_pack_20260427_121815:
k=1024 decode: 106.27% overall: 105.72% prefill: 102.67%
k=1536 decode: 97.55% overall: 95.80% prefill: 114.61%
k=2048† decode: 101.04% overall: 99.34% prefill: 108.48%
coding/256 lower-95 decode retention: 86.60%
reasoning/256 lower-95 decode retention: 85.64%
PPL baseline: 6.7902 | PPL GRC k=1024: 10.9585 (+61.39%)
PPL GRC k=1536: 7.6936 (+13.30%) | PPL GRC k=2048: 7.6936 (+13.30%, identical to k=1536)
A complete reproduction package is at repro/REPRODUCE.md with expected output CSVs in repro/expected_outputs/. Throughput tolerance: ±5% (GPU clock variance); PPL is deterministic to 4 decimal places.