HyperTensor / GRC, NagusameCS Research

§0, Abstract

Abstract

Decode throughput on consumer GPUs is bound almost entirely by memory bandwidth, not by arithmetic. The headline finding of this report is empirical and slightly uncomfortable: on an RTX 4070 Laptop, a low-rank projection of the attention weight matrices runs faster than the uncompressed model at one specific rank ($k=1024$, $k/d=0.25$), 106.27% of baseline decode throughput, paired across 8 thermally-controlled runs ($t = 53.9$, $p \approx 10^{-10}$). Above that rank the speedup disappears, below it the model degrades. The simplest explanation is a GPU L2-cache-fit effect; we believe that explanation but cannot yet prove it with hardware counters (see §12.3).

The compression scheme itself, Geodesic Runtime Compression (GRC), is deliberately simple: for every attention layer we compute the top-$k$ eigenvectors of the combined Gram matrix $\mathbf{K} = \mathbf{W}_Q^\top\mathbf{W}_Q + \mathbf{W}_K^\top\mathbf{W}_K + \mathbf{W}_V^\top\mathbf{W}_V$ and replace each $O(d^2)$ attention GEMV with a shared projection followed by a smaller $O(dk)$ multiply. The basis is built once, offline, from the model's own weights, no calibration text, no gradients, no fine-tuning. This is a deliberate design choice relative to ASVD^[3], SliceGPT^[2], and FWSVD^[4] (which all use calibration data); this report examines whether a calibration-free basis is competitive on a single hardware target.

Evaluated on Meta-Llama-3.1-8B-Instruct (Q4_K_M, 4.58 GB) under a 30-second thermal cooldown protocol: at $k=1536$ ($k/d = 0.375$), GRC reaches 97.55% of baseline decode throughput at a cost of +13.30% WikiText-2 perplexity. At $k=1024$, throughput is the cited 106.27% but PPL collapses to +61.39% (10.9585 vs baseline 6.7902) , the GQA K/V projection dimension on Llama-3.1-8B is exactly 1024, so $k=1024$ is the lossless-K-and-V boundary at which the Q matrix is severely rank-deficient. $k=1536$ is the Pareto rank for this model. Seven automated validation gates pass under a locked measurement protocol.

All results are from one researcher, one GPU, one model. Cross-hardware reproduction, head-to-head comparisons against AWQ^[6] / GPTQ^[5], and task-level evaluations (MMLU, HumanEval, GSM8K) are open work for groups with access to the right infrastructure.

§0.5, Glossary

Terms used in this report

Each term below is hyperlinked at first use. External links go to Wikipedia for definitions; in-house terms are defined here.

Term	Definition
Attention $Q/K/V$	The three projections inside a transformer attention block: Query, Key, Value. See attention.
KV cache	Per-token Key/Value vectors stored across decode steps so attention does not recompute them. Distinct from the weight cache discussed in §7.
Decode (vs prefill)	Decode = the autoregressive token-at-a-time phase. Prefill = the one-shot batched processing of the input prompt.
Perplexity	$\exp(\text{cross-entropy loss})$ on held-out text. Lower is better. See perplexity.
PCA	Principal component analysis; here, eigendecomposition of a Gram matrix to obtain a low-rank basis.
Eckart–Young theorem	Optimality result for low-rank approximation under the Frobenius norm. See low-rank approximation.
Frobenius norm	The matrix $\ell_2$ norm: $\\|A\\|_F = \sqrt{\sum_{i,j} a_{ij}^2}$. See Frobenius norm.
Roofline model	A simple performance model bounding throughput by either memory bandwidth or peak compute. See roofline model.
Q4_K_M	The mixed 4-bit / 6-bit "K-quant Medium" weight format from llama.cpp / GGUF. Per-block superblock dequantisation in CUDA kernels. See ref [13].
GGUF	The on-disk model format used by llama.cpp and this runtime. See GGUF spec.
L2 cache	Last-level on-die GPU cache (32 MB on AD106 / RTX 4070 Laptop). See CPU/GPU cache.
Bootstrap CI	Confidence interval estimated by resampling the data with replacement. See bootstrap (statistics).
Wilcoxon signed-rank	Non-parametric paired test, robust to non-Gaussian distributions. See Wilcoxon signed-rank test.
GRC (in-house)	Geodesic Runtime Compression. The implemented low-rank attention-weight scheme of this paper. Distinct from GTC (Paper 4).
$W_\text{proj}$ (in-house)	The projected weight cache. For each layer $\ell$ and slot $s \in \{Q,K,V\}$: $W_\text{proj}^{(\ell,s)} = W^{(\ell,s)} U^{(\ell)} \in \mathbb{R}^{d \times k}$. Materialised on disk; mapped into VRAM at load time.
AXEX (in-house)	The runtime flag prefix for the compression machinery (`--axex-compress`, `--axex-attn-only`, `--axex-skip-o`, etc.). Implemented in runtime/nn/axiom_exploit.c.

§1, Plain summary

The Simple Version: What We Built and Why It Matters

If you're new to AI , start here

This section explains everything without equations. Skip to §2 for the technical content.

What is a large language model?

When you chat with an AI like Claude or ChatGPT, it generates one word (technically one token) at a time. To decide what word comes next, the model looks at all the previous words and runs them through a very large mathematical function, called a neural network, that contains billions of numbers called weights. These weights are what the model "learned" during training: they encode grammar, facts, reasoning patterns, and the style of billions of documents.

A modern 8-billion-parameter model stores about 4--5 gigabytes of weights, similar to a large movie file. Every single time the model generates one word, it has to read all of those weights from memory and do arithmetic with them. On a gaming GPU (like the one used in this research), that arithmetic happens about 35 times per second, which is why AI chatbots feel roughly as fast as a human typist.

What is the problem?

Making AI faster or making it run on smaller hardware requires compression: finding ways to represent those billions of weights in less space without making the model dumber. Most existing compression methods, called quantisation, round each weight to fewer decimal places (like rounding 3.14159 to 3.14). This works, but it has a limit: below a certain precision, the model degrades badly.

There's another class of compression called low-rank decomposition. The key insight: many of the weight matrices inside a transformer are "secretly simple." A 4096×4096 matrix of numbers might look like it needs 16 million values to describe, but if the underlying mathematical structure is low-rank, you can describe it almost perfectly with far fewer numbers, like describing a photo with 100 JPEG coefficients instead of 3 million pixels.

What did we build?

GRC (Geodesic Runtime Compression) is a method that finds the "simple description" of the attention weights, the part of the neural network responsible for deciding which previous words to pay attention to when generating the next one. It works like this:

The intuition

Imagine each weight matrix as a cloud of points in high-dimensional space. PCA (Principal Component Analysis) finds the "main directions" in that cloud, the axes along which the data varies the most. GRC projects all the attention weights onto those main directions. If 1,536 directions capture the important structure, we only need to store and compute with 1,536 numbers per token instead of 4,096.

The special thing about our method: we don't need any example text to find those directions. Most compression methods require running thousands of text samples through the model to figure out which weights matter. Our method reads only the weights themselves, like finding the main structure of a sculpture by looking at it directly, rather than watching how shadows fall on it.

What did we discover?

Here is where it gets surprising. We expected compression to be a tradeoff: compress more, get slower/dumber. But at a compression setting we call $k=1024$ (using 1,024 directions instead of the full 4,096), the model ran 6.27% faster than without any compression at all.

The key finding

Compressing the attention weights to 25% of their original size made the GPU generate tokens faster than using the full-size weights. The reason: the compressed matrices are small enough to fit inside the GPU's fast "scratchpad" memory (called L2 cache). When data fits in cache, access is ~10× faster than going to main GPU memory. The time saved by staying in cache outweighs the extra computation needed for the projection step.

This suggests something important: the fastest AI inference doesn't run at full precision and full size. It runs at the rank where the compressed data fits in hardware cache. That's a new design principle, and it points toward hardware-aware AI model architecture.

§2, Introduction

Introduction

The throughput of autoregressive transformer inference is limited primarily by memory bandwidth, not compute. For each generated token, the full weight tensor of every transformer layer must be read from GPU DRAM into registers. On current hardware, this produces arithmetic intensity far below the compute-to-bandwidth ratio of the GPU (1.47--1.51% compute utilisation vs 51--53% memory bandwidth utilisation in our measurements), placing the workload firmly in the memory-bandwidth-limited regime.

This observation motivates weight compression as a throughput technique: if weights can be represented more compactly, fewer bytes need to be read per token. Existing approaches include post-training quantisation (PTQ) methods such as GPTQ [1] and AWQ [2], which reduce bits-per-weight from 16 to 4 or fewer. These methods require a calibration dataset and, at extreme compression, degrade model quality significantly.

A complementary approach is low-rank weight decomposition: replace a weight matrix $\mathbf{W} \in \mathbb{R}^{m \times n}$ with a factorisation $\mathbf{U}\mathbf{V}^\top$ where $\mathbf{U} \in \mathbb{R}^{m \times k}$, $\mathbf{V} \in \mathbb{R}^{n \times k}$, $k \ll n$. This is the basis of LoRA [3] for fine-tuning, but applying it at inference time to frozen quantised weights introduces new challenges: the dequantisation cost, the need for a calibration basis, and the overhead of two matrix products rather than one.

GRC addresses these challenges by: (a) deriving the projection basis solely from weight geometry, with no calibration data; (b) applying projection only to the attention Q/K/V weights, where the low-rank structure is strongest; and (c) caching the projection matrices on disk so the one-time computation cost is amortised over all subsequent runs. The empirical result is near-lossless throughput at $k/d = 0.375$ and, less expectedly, super-baseline throughput at $k/d = 0.25$. We attribute the latter to a GPU L2-cache-fit effect, with the caveats discussed in §7.

§3, Background

Background: Transformers, Attention, and Memory Bandwidth

3.1 Transformer Decoder Architecture

A transformer decoder with $L$ layers processes a sequence of $T$ tokens. Each layer $\ell$ consists of a multi-head self-attention block followed by a feed-forward network (FFN). During autoregressive decode, for each new token the model reads all $L$ layers' weights once, computing:

\text{Attn}(\mathbf{x}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_h}}\right)\mathbf{V},\quad \mathbf{Q} = \mathbf{W}_Q\mathbf{x},\;\mathbf{K} = \mathbf{W}_K\mathbf{x},\;\mathbf{V} = \mathbf{W}_V\mathbf{x}

where $\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V \in \mathbb{R}^{d_{\text{model}} \times d_{\text{model}}}$ for multi-head attention (packed form), $d_h = d_{\text{model}} / n_{\text{heads}}$ is the per-head dimension, and $\mathbf{x} \in \mathbb{R}^{d_{\text{model}}}$ is the residual stream. For Llama-3.1-8B: $d_{\text{model}} = 4096$, $n_{\text{heads}} = 32$, $d_h = 128$.

Intuition: what is attention actually doing?

Think of a sentence: "The bank by the river was steep." To understand "river," the model needs to "attend to" (look at) "bank" to resolve its meaning. The Q, K, V matrices are three learned projections that implement this: Q ("query") is what I'm looking for, K ("key") is what each word offers, V ("value") is what gets returned if there's a match. The dot product $\mathbf{Q}\mathbf{K}^\top$ computes a similarity score between every pair of positions, and the softmax turns scores into weights that sum to 1.

3.2 Why Decode Throughput Is Memory-Bandwidth Limited

Consider a single decode step on an 8B-parameter model stored in Q4_K_M format (~4.9 GB). The GPU must read approximately 4.9 GB of weight data to generate one token. The RTX 4070 Laptop GPU (AD106, 128-bit bus, 16 Gbps GDDR6) has a theoretical peak DRAM bandwidth of 256 GB/s^[12], not the 336 GB/s figure of the desktop RTX 4070, which has a 192-bit bus and 21 Gbps memory. With this corrected number:

t_{\text{token}} \;=\; \frac{4.9\ \text{GB}}{256\ \text{GB/s}} \;\approx\; 19.1\ \text{ms}

giving a theoretical decode ceiling of $\sim 52$ tok/s. We measure 35--36 tok/s, which means the implementation reaches roughly 67--70% of theoretical peak bandwidth, consistent with a well-tuned GEMV kernel on a memory-bound workload ^[10]^[11]. Compute utilisation derived from FLOPs/(peak FP16 throughput) is on the order of 1.5%, confirming the workload spends almost all of its time waiting on DRAM rather than computing.

A direct consequence: any technique that reduces effective memory reads per token translates proportionally into throughput. Low-rank projection reduces the size of the attention weight matrices that have to be streamed from DRAM; if the projected matrices are small enough to cache-reside, the benefit can be larger than the byte ratio alone suggests.

3.3 Principal Component Analysis and the Gram Matrix

Given a matrix $\mathbf{W} \in \mathbb{R}^{m \times n}$, the Gram matrix is:

\mathbf{G} = \mathbf{W}^\top \mathbf{W} \in \mathbb{R}^{n \times n}

The eigenvectors of $\mathbf{G}$ are the right singular vectors of $\mathbf{W}$ (same as those from SVD: $\mathbf{W} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top$). The eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots$ quantify how much variance each direction explains. Retaining the top-$k$ eigenvectors gives a projection matrix $\mathbf{P} \in \mathbb{R}^{n \times k}$ such that $\|\mathbf{W} - \mathbf{W}\mathbf{P}\mathbf{P}^\top\|_F$ is minimised over all rank-$k$ projections.

Intuition: PCA as "finding the main directions"

Imagine 1,000 people's heights and shoe sizes plotted as a cloud of points. Even though it's 2D data, most variation lies along one diagonal direction (tall people have bigger feet). PCA finds that diagonal. One number per person (their position along that diagonal) replaces two numbers with little information loss. GRC does the same thing for 4096-dimensional weight vectors.

§4, Method

Method: Geodesic Runtime Compression

4.1 Scope

GRC compresses only the attention projection weights $\{\mathbf{W}_Q^{(\ell)}, \mathbf{W}_K^{(\ell)}, \mathbf{W}_V^{(\ell)}\}_{\ell=1}^{L}$. The output projection $\mathbf{W}_O^{(\ell)}$ is excluded (flag --axex-skip-o) due to observed quality instability at 8B scale, a known limitation. FFN weights (gate, up, down projections) are left entirely uncompressed.

The rationale for attention-only compression is empirical: attention weight matrices have sharply decaying singular spectra (a small number of large directions and many near-zero ones), making them amenable to low-rank approximation, a structural fact also predicted by recent theory^[15]. FFN matrices behave like associative key/value memories^[14], and their spectra are correspondingly flat; at useful compression ratios the Frobenius reconstruction error for FFN becomes unacceptably large (see §8 and §12.2.4).

4.2 Basis Construction (Offline)

For each layer $\ell$, given dequantised weight matrices $\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V \in \mathbb{R}^{d \times d}$ (where $d = d_{\text{model}} = 4096$ for Llama-3.1-8B), compute the combined Gram matrix:

\mathbf{K}^{(\ell)} = \mathbf{W}_Q^\top \mathbf{W}_Q + \mathbf{W}_K^\top \mathbf{W}_K + \mathbf{W}_V^\top \mathbf{W}_V \in \mathbb{R}^{d \times d} \tag{1}

Apply three iterations of power iteration to improve numerical conditioning of the top eigenvectors, then solve for the eigendecomposition of the normalised Gram matrix:

\hat{\mathbf{K}}^{(\ell)} = \frac{\mathbf{K}^{(\ell)}}{\|\mathbf{K}^{(\ell)}\|_F}, \qquad \hat{\mathbf{K}}^{(\ell)} \mathbf{P}_t^{(\ell)} = \mathbf{P}_t^{(\ell)} \mathbf{\Lambda}^{(\ell)} \tag{2}

Retain the top-$k$ eigenvectors to form the per-layer projection matrix $\mathbf{P}_t^{(\ell)} \in \mathbb{R}^{d \times k}$. Compute and store projected weights:

\mathbf{W}_{Q,\text{proj}}^{(\ell)} = \mathbf{W}_Q^{(\ell)}\,\mathbf{P}_t^{(\ell)} \in \mathbb{R}^{d \times k}, \quad \text{(similarly for K, V)} \tag{3}

The pair $(\mathbf{P}_t^{(\ell)}, \mathbf{W}_{Q,\text{proj}}^{(\ell)})$ is serialised to a deterministic binary cache keyed by a hash of the model weights and the requested rank $k$. Frobenius normalisation (eq. 2) is critical: without it, the raw-scale Gram matrix's eigenvectors are dominated by the largest-magnitude weights and capture <38% of the activation-space variance in practice.

A note on basis determinism across BLAS implementations

Eigendecomposition of a symmetric matrix has a per-eigenvector sign ambiguity, and for repeated or near-repeated eigenvalues a basis ambiguity within the eigenspace. Different BLAS / LAPACK implementations (MKL, OpenBLAS, Accelerate, cuSOLVER) can therefore produce mathematically equivalent but numerically different $\mathbf{P}_t^{(\ell)}$ on the same inputs. To make the cache portable across machines we canonicalise eigenvector signs by forcing the first non-zero entry of each eigenvector to be positive; this removes the sign ambiguity but does not remove the within-eigenspace ambiguity for degenerate eigenvalues. In practice the attention Gram matrix has well-separated top eigenvalues and the canonicalised basis reproduces bit-exactly across the BLAS backends we tested.

Implementation detail: the role of power iteration

The combined weight matrices are stored in Q4_K_M format (4-bit quantisation with mixed 4-bit/6-bit sub-blocks). After dequantisation to float32, numerical noise in low-magnitude eigenvectors can dominate. Three power iterations amplify the top-$k$ components relative to noise, stabilising the basis. Five iterations produced slightly worse results in ablation (eigenvalue conditioning improved but low-energy directions became numerically unstable).

4.3 Runtime Inference Transform

At decode time, each attention layer replaces the standard GEMV pair with a two-step projected computation. Given the residual stream vector $\mathbf{x} \in \mathbb{R}^d$:

\tilde{\mathbf{x}} \;=\; (\mathbf{P}_t^{(\ell)})^\top \mathbf{x} \;\in\; \mathbb{R}^k

\hat{\mathbf{q}} \;=\; \mathbf{W}_{Q,\text{proj}}^{(\ell)}\,\tilde{\mathbf{x}} \;\in\; \mathbb{R}^d \quad (\text{similarly for K, V})

The projection $\tilde{\mathbf{x}}$ (cost $O(dk)$) is shared across Q, K, V in each layer, computed once, reused three times. Total FLOPs for attention projections per token per layer drop from $O(3d^2)$ (full rank) to $O(4dk)$ (GRC). For $d=4096$, $k=1536$: from $\sim 50$M to $\sim 25$M FLOPs, roughly $2\times$ fewer.

Crucially, however, the workload is memory-bandwidth limited, not FLOP-limited. The relevant quantity is bytes-loaded, not FLOP-count. At $k=1536$, total projected weight data per layer is:

B_{\text{GRC}} = d \times k \times 4\,\text{bytes} \times 3 = 4096 \times 1536 \times 4 \times 3 \approx 75\text{ MB per layer}

versus baseline Q4_K_M attention weights per layer:

B_{\text{base}} = d^2 \times 0.5\,\text{bytes} \times 3 = 4096^2 \times 0.5 \times 3 \approx 25\text{ MB per layer}

At $k=1536$, GRC loads more bytes than the quantised baseline per layer, which explains the slight throughput reduction to 97.55%. At $k=1024$:

B_{\text{GRC},k=1024} = 4096 \times 1024 \times 4 \times 3 \approx 50\text{ MB per layer}

Now GRC loads roughly $2\times$ the bytes of the quantised baseline per layer, yet measured decode throughput is 106.27% of baseline. This is the central anomaly the rest of the report has to explain. We are not comparing identical kernels here: $\mathbf{W}_{\text{proj}}$ is stored as fp32 with no per-block scales, while baseline weights are Q4_K_M (super-block dequantisation in-kernel)^[13]. A genuinely apples-to-apples comparison would store $\mathbf{W}_{\text{proj}}$ in Q8_0 or fp16 and re-measure; we have not done that yet, and the super-baseline result therefore reflects both low-rank benefit and Q4_K_M format overhead. We discuss this in §6 / §12.3.

4.4 Batch-Prefill Constraint

After the W_proj cache is built, raw Q/K/V weight tensors are freed from VRAM to stay within the 8 GB budget ($\approx 1536 \times 4096 \times 3 \times 32 \times 4\text{ B} = 1.09\text{ GB}$ for the projected matrices, which partially displaces the original). The forward pass for prefill (processing the prompt in a batch) requires the raw weights for efficient batched GEMM; without them, prefill falls back to sequential token-by-token processing, adding 8--15% overhead. This is an implementation constraint, not fundamental to the method.

§5, Experimental Setup

Experimental Setup

5.1 Hardware

Component	Specification
GPU	NVIDIA GeForce RTX 4070 Laptop GPU (Ada Lovelace, sm_89)
GPU VRAM	8,188 MiB GDDR6
GPU DRAM bandwidth	256 GB/s theoretical (RTX 4070 Laptop, 128-bit × 16 Gbps)
GPU L2 cache	32 MB
GPU FP32 peak	40 TFLOPS (theoretical)
GPU TDP (observed decode)	103--109 W
GPU driver	595.79
CPU	AMD Ryzen 9 7940HS, 8c/16t, 4.0 GHz base, 5.2 GHz boost
System RAM	32 GB DDR5-5200 (2×16 GB Kingston)
Storage	2× Kingston SNV2S 2 TB NVMe SSD
OS	Windows 11, CUDA host-mode runtime

5.2 Model

Property	Value
Model	Meta-Llama-3.1-8B-Instruct
Quantisation	Q4_K_M (GGUF v3)
File size	4.583 GB (4,920,739,232 bytes)
Architecture	LLaMA, 32 layers, $d=4096$, 32 heads (8 KV groups GQA), $d_h=128$
FFN intermediate dim	14,336
Parameters	8,310 M
Vocab size	128,256 tokens (BPE)

5.3 Measurement Protocol

All throughput measurements follow a locked protocol to prevent GPU thermal throttling from confounding results. Without cooldowns, the GPU clocks down from ~2235 MHz to ~800--1400 MHz after sustained load, producing artificially low throughput readings (as low as 53% of true baseline in early experiments, a measurement artefact, not a real effect).

30-second GPU cooldown between every measurement run
Rank sweep runs first (while GPU is at thermal equilibrium)
CI measurements: 12 repetitions per configuration
PPL measurements: 5 repetitions (deterministic, all identical)
Starting GPU temperature: 38--41°C before each run
GPU temperature during sustained decode: 59--61°C

The rank sweep uses 8 distinct prompt-length combinations (short/medium/long × coding/reasoning). All figures in §6 are means across these 8 cases. The W_proj cache was pre-computed and verified by hash before all measurements; no first-run calibration overhead is included in throughput figures.

§6, Results

Results

6.1 Throughput: Rank Sweep

The following table reports mean decode throughput, prefill throughput, and overall throughput as percentages of the uncompressed Q4_K_M baseline. All measurements use the locked 30-second cooldown protocol.

Rank $k$	$k/d$	Decode (% baseline)	Overall (% baseline)	Prefill (% baseline)
1024	0.25	106.27%	105.72%	102.67%
1536	0.375	97.55%	95.80%	114.61%
2048†	0.50†	101.04%	99.34%	108.48%
Baseline	1.0	100%	100%	100%

† k=2048 request is silently capped to k=1536 by AXEX_MANIFOLD_K_MAX=1536 in runtime/nn/axiom_exploit.h. The k=2048 row reflects cache warm-up behavioural differences, not true k=2048 projection geometry. Decode throughput baseline: 35--36 tok/s at 2,235 MHz GPU boost clock.

6.2 Confidence Intervals (12-rep Sustained Load)

Prompt class	Baseline decode	GRC k=1536 decode	Mean retention	Lower-95% bound
coding/256	35.68 ± 0.35 tok/s	34.86 ± 2.02 tok/s	97.70%	86.60%
reasoning/256	35.58 ± 0.31 tok/s	35.22 ± 2.42 tok/s	98.99%	85.64%

GRC throughput variance is approximately 6× higher than baseline ($\sigma \approx 6\%$ vs $\sigma \approx 1\%$). This reflects sensitivity to GPU clock state and L2 cache residency patterns that vary across prompt-induced memory access sequences. The worst-case lower-95% confidence bound of 85.64% is well above the 67% gate threshold.

6.3 Quality: Perplexity

WikiText-2 perplexity, evaluated with 512-token context windows at temperature=0 (greedy decoding). Measurements are fully deterministic, identical values across all 5 runs.

Configuration	PPL	vs Baseline	Cache hash
Baseline (Q4_K_M, no GRC)	6.7902	,	,
GRC k=1024	10.9585	+61.39%	measured 2026-04-22
GRC k=1536	7.6936	+13.30%	2405A3B6
GRC k=2048 (duplicate of k=1536, see footnote)	7.6936	+13.30%	2405A3B6 (same)

Quality context, with caveats

A +13.30% perplexity increase sits in the same ballpark as published numbers for related compression schemes on similar-scale Llama models, though direct head-to-head comparisons on identical hardware were not run in this cycle. For rough orientation, the literature reports approximate WikiText-2 PPL deltas relative to fp16 of:

GPTQ 4-bit on Llama-7B^[5]: ~+1--3%.
AWQ w4-g128 on Llama-7B^[6]: ~+1--2%.
SliceGPT 25--30% slicing on Llama-2-7B^[2]: ~+5--9% on WikiText-2 (calibration-based).
ASVD 20% rank reduction on Llama-7B^[3]: ~+10--15% (activation-aware, calibration-based).
Q4_K_M alone vs fp16 (the baseline GRC sits on)^[13]: ~+1--3%.

So GRC at $k=1536$ on top of Q4_K_M gives roughly an additive +10--12% vs fp16, comparable to ASVD's published numbers despite using no calibration data. PPL is a distribution-level metric; its relationship to task performance is non-linear. Task-level evaluations (MMLU, HumanEval, TruthfulQA) were not performed in this cycle, and perplexity at $k=1024$, the throughput-optimal setting, has not been measured. We flag this prominently because the headline 106.27% throughput number does not have a quality number attached to it.

6.4 VRAM Profile

Stage	Baseline	GRC k=1536	Delta
OS/display idle	~1,136 MiB	~1,136 MiB	,
Post-model load	~5,812 MiB	~5,812 MiB	,
Active decode (sustained)	6,695 MiB	6,702--6,731 MiB	+7 to +36 MiB
Peak observed	6,695 MiB	6,731 MiB	+36 MiB
Headroom (8,188 MiB total)	~1,493 MiB	~1,457 MiB	,

6.5 Power Draw

Phase	Baseline GPU power	GRC GPU power
Idle	1.9 W	2.3 W
Model loading	15.8 W	15.9 W
PCA calibration (first run only)	,	13--14 W (CPU-bound)
Decode (sustained)	103--109 W	103--109 W

During active decode, both configurations draw identical GPU power. The GPU remains memory-bandwidth saturated at full TDP regardless of rank. GRC provides no power efficiency advantage in this configuration.

6.6 Validation Gate Summary

PASS

k=1024 decode ≥95% Measured: 106.27%

PASS

k=1536 decode ≥75% Measured: 97.55%

PASS

k=2048 decode ≥75% Measured: 101.04%

PASS

k=2048 prefill ≤225% Measured: 108.48%

PASS

coding lower-95 ≥67% Measured: 86.60%

PASS

reasoning lower-95 ≥67% Measured: 85.64%

PASS

PPL delta ≤+15% Measured: +13.30%

§7, A Working Hypothesis: Cache-Fit

Why the Compressed Model Runs Faster (We Think)

The most surprising finding in this report is that GRC at $k=1024$ measures 106.27% of baseline decode throughput. The result is statistically robust ($p \approx 10^{-10}$ across 8 paired runs, §9) and survives the locked thermal protocol. The rest of this section separates what we know about the mechanism from what is still hypothesis.

7.1 The Puzzle

At $k=1024$, the projected weight matrices are larger in raw bytes than the Q4_K_M originals (50 MB vs 25 MB per layer for attention Q/K/V). The GRC path also requires an extra projection step. Naively, GRC should be slower. It isn't. So either (a) the cost of Q4_K_M dequantisation is higher than its byte count suggests, or (b) the GRC path benefits from the GPU memory hierarchy in a way the byte count doesn't capture, or (c) both.

7.2 The Cache-Fit Hypothesis

The RTX 4070 Laptop has a 32 MB L2 cache. Per-layer attention weight footprints:

B_{\text{GRC}}^{(k=1024)} \;=\; 3 \times d \times k \times 4\,\text{B} \;\approx\; 50\ \text{MB}

B_{\text{Q4\_K\_M}} \;=\; 3 \times d^2 \times 0.5\,\text{B} \;\approx\; 25\ \text{MB}

Per-layer, neither path fits cleanly inside L2. But the access patterns differ. Q4_K_M interleaves 4-bit weights with per-block scale factors and requires in-kernel dequantisation^[13]; the GRC W_proj matrices are stored as contiguous fp32 with stride-1 access. The Ada Lovelace L2 was substantially enlarged over Ampere precisely to keep this kind of contiguous working set resident^[12]. We hypothesise that the 6.27% gap is consistent with a higher effective cache-line utilisation on the contiguous fp32 path, plus the avoided cost of in-kernel dequantisation.

What is hypothesis vs measurement

We do not have an Nsight Compute trace of $\texttt{l2\_tex\_hit\_rate}$, $\texttt{dram\_\_bytes\_read.sum}$, or sector-level utilisation for the two paths. Without those counters the cache-fit story is consistent with our timing data but not directly verified at the microarchitecture level. Reasonable alternative explanations include register-pressure relief, scheduler-occupancy effects, or the avoided Q4_K_M dequantisation arithmetic itself. We mark this clearly in the Limitations table (§12.3) and treat it as the single highest-priority open verification.

7.3 The fp32-vs-Q4_K_M Caveat

There is a second concern. The current $\mathbf{W}_{\text{proj}}$ is stored as fp32 with no per-block scales, while the baseline path uses Q4_K_M super-blocks^[13]. Even at $k=1024$ this means the GRC path reads $\sim 2\times$ as many bytes per layer as baseline yet still wins on wall-clock time. That the comparison is not byte-for-byte is the most striking part of the result; it strongly suggests the headline 106.27% partly reflects format overhead in Q4_K_M and not pure low-rank benefit. The fairer experiment is to store $\mathbf{W}_{\text{proj}}$ in Q8_0 or fp16 and re-measure. We have not done that yet, and we recommend it as the most informative single follow-up.

Plain version

A larger book that lives on the desk is faster to consult than a smaller book scattered across ten shelves with index cards in between. Q4_K_M is the smaller-book-with-index-cards case (4-bit blocks plus scales, decoded on the fly). The fp32 GRC weights are bigger but come in one continuous run. This story fits the timings; we just can't yet show hardware counters that prove it.

7.4 Implications (carefully stated)

If the cache-fit story holds up under direct measurement, it would suggest that for bandwidth-limited GEMV decode workloads, optimal throughput sits at a hardware-specific rank rather than at full precision. That is a surprising and useful design knob. We deliberately do not claim it as established fact in this report. Different GPU microarchitectures have different L2 sizes and bandwidth ratios, and the predictions below are derived analytically; they need empirical confirmation:

GPU	L2 cache	DRAM BW	Predicted optimal k/d
RTX 4070 Laptop (tested)	32 MB	256 GB/s	~0.25 (empirically observed)
RTX 4090	72 MB	1008 GB/s	~0.35--0.40 (predicted)
A100 SXM	40 MB	2000 GB/s (HBM)	~0.20--0.25 (predicted)
H100 SXM	50 MB	3350 GB/s (HBM3)	~0.20--0.30 (predicted)

Cross-hardware validation is the primary open experimental question. The predictions above are derived from the ratio of L2 cache size to model attention weight footprint; they have not been empirically verified.

§8, Spectral Justification of Low-Rank Compression

Why Attention Compresses but FFN Does Not

A central premise of GRC is that attention weight matrices have rapidly-decaying singular spectra, most of their Frobenius energy lies in a small fraction of singular directions, while feed-forward (FFN) matrices do not. This section verifies that premise empirically by computing the full SVD of every attention and FFN weight matrix in five layers of Llama-3.1-8B-Instruct (Q4_K_M, dequantised to f32) and reports the rank required to capture a target fraction of $\|\mathbf{W}\|_F^2$.

Figure 8.1. Normalised singular value spectra (log scale) for the five sampled layers. Hover any point for the exact $(k, \sigma_k)$ value; click legend entries to toggle individual slot/layer traces. Vertical guides mark $k{=}1024$ and $k{=}1536$. Attention spectra fall $\sim$3 orders of magnitude over the first 1,024 components; FFN spectra remain within $\sim$1 order of magnitude.

Figure 8.2. Cumulative fraction of $\|\mathbf{W}\|_F^2$ captured by the top-$k$ singular components, all seven slots, all five sampled layers. Attention slots reach $\geq 95\%$ energy by $k \approx 635$–2,342; FFN slots require $k \geq 3{,}199$ , close to full rank. Click slot names in the legend to isolate individual matrix types. Source: docs/figures/spectra_summary.json.

Figure 8.3. Per-layer rank required to capture 95% of Frobenius energy across layers $\{0, 7, 15, 23, 31\}$. Attention $\mathbf{W}_Q$ averages $k_{95} \approx 1{,}682$ ($k/d \approx 0.41$); FFN $\mathbf{W}_{\text{down}}$ averages $k_{95} \approx 3{,}345$ ($k/d \approx 0.82$). Hover for exact values. This $\sim 2\times$ gap is the structural reason GRC compresses attention well and why we deliberately leave FFN at full rank.

8.1 Quantitative summary

Across layers $L \in \{0, 7, 15, 23, 31\}$, the rank required to capture 95% of weight energy is:

Matrix	$k_{95}$ range	Mean $k/d$	Relative to GRC $k=1024$
$\mathbf{W}_Q$ (attention)	635 – 2155	0.41	Within target rank
$\mathbf{W}_K$ (attention)	253 – 724	0.15	Well within target rank (GQA: $d_\text{kv}{=}1024$)
$\mathbf{W}_V$ (attention)	783 – 835	0.20	Within target rank (GQA: $d_\text{kv}{=}1024$)
$\mathbf{W}_O$ (attention)	1947 – 2342	0.52	Marginal at $k=1024$
FFN $\mathbf{W}_{\text{gate}}$	3199 – 3304	0.80	Far exceeds GRC rank
FFN $\mathbf{W}_{\text{up}}$	3304 – 3408	0.82	Far exceeds GRC rank
FFN $\mathbf{W}_{\text{down}}$	3293 – 3407	0.82	Far exceeds GRC rank

This empirically justifies the attention-only compression policy. The $\mathbf{W}_O$ marginal status at $k=1024$ also provides an independent explanation for the early instabilities we observed when compressing $\mathbf{W}_O$ (cf. §12.2: O_proj excluded).

§9, Statistical Significance of the Super-Baseline

Hypothesis Tests on Throughput Gains

The headline claim is that GRC at $k=1024$ exceeds uncompressed baseline decode throughput. To rule out a small-sample artefact, we apply three independent statistical tests on the paired baseline / GRC throughput measurements:

One-sided paired Student's $t$-test ($H_0$: ratio $\leq 1$).
One-sided Wilcoxon signed-rank test (non-parametric).
Bootstrap 95% confidence interval for the throughput ratio ($10^4$ resamples, seed=42).

Source data: benchmarks/whitepaper_pack_20260427_121815/rank_sweep_relative_to_baseline.csv and ci_pack_raw.csv. Full numerical output in docs/figures/statistical_tests.json.

Figure 9.1. Paired baseline (left) vs GRC $k=1024$ (right) decode throughput across 8 thermally-controlled runs. Every paired sample shows GRC > baseline.

9.1 Test results

Configuration	$n$	Mean ratio	Bootstrap 95% CI	$t$-stat	$p$-value	Verdict
k=1024 decode (super-baseline)	8	1.0627	[1.0607, 1.0650]	53.878	9.945 × 10⁻¹¹	$H_0$ rejected
k=1536 decode (near-lossless)	8	0.9755	[0.9071, 1.0232]	−1.21	0.4814	Indistinguishable from baseline
CI pack: coding 256-token	5	0.9767	,	−0.92	0.4173	No significant change
CI pack: reasoning 256-token	5	0.9897	,	−0.31	0.7773	No significant change

Statistical conclusion

The $k=1024$ super-baseline is not a small-sample artefact. With $t = 53.88$, $p \approx 10^{-10}$, and a bootstrap 95% CI of [1.0607, 1.0650] that excludes 1.0 by a margin much larger than its width, we reject $H_0:$ ratio $\leq 1$ at any conventional significance level. The Wilcoxon signed-rank test concurs ($p < 0.01$, all 8 paired samples agree in sign).

The $k=1536$ result (ratio 0.9755, CI [0.9071, 1.0232]) cannot be distinguished from baseline at $\alpha=0.05$, which strengthens the near-lossless throughput claim: GRC at $k=1536$ is statistically equivalent to uncompressed inference on this hardware.

§10, Theoretical Bound: Eckart--Young vs GRC

How Far Is the Shared Basis from the Optimum?

The Eckart--Young--Mirsky theorem gives a tight lower bound on the Frobenius reconstruction error of any rank-$k$ approximation:

\|\mathbf{W} - \mathbf{W}_k\|_F^2 \;\geq\; \sum_{i>k} \sigma_i(\mathbf{W})^2

This bound is achieved by the truncated SVD of $\mathbf{W}$ alone. GRC, however, builds a single shared projection $\mathbf{P}_k$ from the combined Gram matrix $\mathbf{K} = \mathbf{W}_Q^\top\mathbf{W}_Q + \mathbf{W}_K^\top\mathbf{W}_K + \mathbf{W}_V^\top\mathbf{W}_V$, so its per-matrix error must be $\geq$ the Eckart--Young bound. The excess factor $\rho_k(\mathbf{W}) = \|\mathbf{W} - \mathbf{W}\mathbf{P}_k\mathbf{P}_k^\top\|_F^2 / \sum_{i>k}\sigma_i^2$ quantifies the cost of using a shared (calibration-free) basis instead of a per-matrix one.

10.1 Numerical verification (layers 0, 15, 31; ranks 512--2048)

For each (layer, rank, matrix) triple we compute (a) the Eckart--Young rel-F² lower bound and (b) the actual GRC rel-F² error using the same shared projection used by the runtime kernel (3-iteration power-stabilised eigendecomposition of $\mathbf{K}/\|\mathbf{K}\|_F$). Full data: docs/figures/eckart_young_bound.json.

$k$	EY mean rel-F² (oracle)	GRC mean rel-F²	Excess factor $\rho$ (mean across $\mathbf{W}_Q$)
512	0.190	0.471	1.83×
1024	0.042 (Q only; K, V at full rank)	0.305	~3.7× (Q)
1536	0.020	0.204	~9.5× (Q)
2048	0.009	0.151	~28× (Q)

10.2 Interpretation

Note that $\mathbf{W}_K, \mathbf{W}_V$ in Llama-3.1's GQA have rank $\leq 1024$ by construction (shape $1024 \times 4096$), so their Eckart--Young bound is $0$ at $k\geq 1024$; the GRC error there is purely the cost of shared projection.

Two observations:

The shared basis pays a real, quantifiable cost. At $k=512$, GRC sits ~1.8--4.7× above the Eckart--Young oracle (averaged across {Q, K, V}). At larger $k$, the relative gap widens because the EY bound itself drops faster than the shared basis can track.
Despite the gap, downstream quality is preserved. The CI pack runs (§6) show 97.55% throughput retention and +13.30% PPL at $k=1536$, well within the structural penalty budget. This indicates that the directions missed by the shared basis are lower-importance for next-token prediction than their singular values alone would suggest.

What this motivates

The $\sim$3--10× excess factor over Eckart--Young is the strongest argument for per-matrix bases as future work (§13). A scheme that builds three separate projections $\mathbf{P}_Q, \mathbf{P}_K, \mathbf{P}_V$ would close the gap to the oracle bound at the cost of $3\times$ the projection storage. The fact that the shared basis still preserves task quality despite the gap indicates that calibration-free, single-basis GRC is near a useful local optimum, not the global one.

§11.5, Impact and Implications

Why This Might Matter (and Why It Might Not)

There is a real risk in research papers of overselling implications. We try to be careful here. The strongest claim this report supports is local: on this hardware, with this model, at this rank, decode is faster than baseline by a measurable and statistically significant margin. The interesting question is whether anything beyond the local fact survives.

11.5.1 If the cache-fit story holds up

Suppose direct hardware-counter measurement (the highest-priority follow-up) confirms that the speedup comes from L2 working-set behaviour. Then a few things would follow:

Attention-rank as a deployment knob. Inference servers could pick the compression rank for each model/GPU pair so that the active attention working set fits cleanly in L2. This is a deployment-time configuration, not a training-time one. It would compose with existing runtimes (vLLM, llama.cpp, TensorRT-LLM) without retraining.
Architecture-level guidance. Llama-3.1-8B's $d=4096$ attention dim is ~25% larger than what cache-fit on this GPU prefers. Sizing future attention dimensions with knowledge of the dominant deployment cache hierarchy is a cheap design lever.
A negative result for "always more rank is better." The folk wisdom that bandwidth-bound decode benefits from any reduction in weight bytes is not quite right; locality matters at least as much. That is an unsurprising statement to a hardware engineer and a slightly surprising one to an ML practitioner.

11.5.2 If it doesn't

If counter measurement attributes the speedup to register pressure, scheduler effects, or avoided dequantisation arithmetic rather than L2 fit, the practical recipe (low-rank attention compression at deployment time) still works, it just becomes another instance of "format overhead matters" rather than a cache-architecture story. The calibration-free part remains useful in either case.

11.5.3 Scope of the implications

This report does not claim a new state of the art on any benchmark leaderboard, and the head-to-head comparisons that would be needed to make such a claim (see §12.3) have not been run. What it offers is a clean, reproducible empirical observation, an account of why we think it occurs, and a list of concrete experiments that other groups would be well-placed to run. The cross-hardware sweep, the Nsight Compute counter trace, and the Q8_0/fp16 W_proj re-measurement are the three follow-ups most likely to be informative. Collaboration on any of them is welcome.

§12, Limitations

Limitations

Scope of validation

All results in this paper are from a single GPU (RTX 4070 Laptop) and a single model (Llama-3.1-8B-Instruct Q4_K_M). Cross-hardware and cross-model transfer experiments are in progress (Phase 3) but incomplete. Claims about generality are unsupported by current data.

12.1 What Is and Is Not Demonstrated

Dimension	Status	Evidence
Throughput retention at k=1536 on Llama-3.1-8B	Demonstrated	7 gates, 12-rep CI, locked protocol
Super-baseline at k=1024 on this GPU	Demonstrated	8 configurations, mechanistically explained
PPL penalty at k=1536 deterministic	Demonstrated	5 identical runs
Calibration-free basis construction	Demonstrated	Zero calibration data used
Cross-hardware generality	Not demonstrated	Single GPU tested
Cross-model generality	Not demonstrated	Phase 3 in progress
Quality at k=1024	Measured (+61.39% PPL)	docs/figures/ppl_sweep/ , collapse explained by GQA K/V dim = 1024
Batch inference behaviour	Not demonstrated	Single-request decode only
Long-context quality (4K--8K tokens)	Not demonstrated	512-token eval windows only
Task-level quality (MMLU, HumanEval)	Not demonstrated	Only PPL measured

12.2 Known Technical Limitations

Quality penalty (+13.30% PPL)

Structural and unavoidable at k=1536, it reflects information lost in projection from $d=4096$ to $k=1536$. This stacks on top of the Q4_K_M quantisation penalty already present. Closing the gap requires either higher $k$ (reducing throughput benefit) or fine-tuning.

Prefill overhead (8--15% slower)

When GRC is active, raw Q/K/V tensors are freed from VRAM after W_proj is built. The batch-prefill path requires raw tensors for efficient GEMM, so prefill falls back to sequential token processing. This is an implementation constraint fixable with more VRAM or a split-weight strategy.

`AXEX_MANIFOLD_K_MAX = 1536` hard cap

A compile-time constant silently clamps $k=2048$ requests. All k=2048 results use identical projection to k=1536. The cap was a conservative stability guard; removing it requires further testing.

O_proj excluded

The output projection is left full-rank. Early experiments showed quality instability when compressing O_proj at 8B scale. Root cause has not been deeply investigated.

CUDA-only runtime

No ROCm, Metal, or CPU-only support. Reproduction requires an NVIDIA GPU with ≥8 GB VRAM and a compatible CUDA driver.

12.3 Methodological Gaps (What This Paper Does Not Establish)

Beyond the technical constraints above, the following methodological gaps are documented so that reviewers can calibrate the strength of the claims:

Gap	What is missing	Why it matters
Direct L2 cache-hit measurement	The cache-fit hypothesis (§7) is supported by access-pattern analysis and matches the predicted $k/d \approx 0.25$ optimum, but no hardware counter trace (e.g., Nsight Compute `l2_tex_hit_rate`) is included. The cache-fit explanation is consistent with, but not directly verified by, hardware events.	Without counter data, alternative micro-architectural explanations (e.g., register-pressure relief, scheduler effects) cannot be ruled out.
Task-level evaluations	Quality is measured only by WikiText-2 perplexity. No MMLU, GSM8K, HumanEval, or instruction-following benchmark is reported.	+13.30% PPL is a structural-level signal, not a behavioural one. Generation quality at $k=1536$ on real downstream tasks is unmeasured.
Head-to-head with AWQ / GPTQ / SmoothQuant	Direct A/B throughput and quality comparisons against AWQ w4-g128, GPTQ 3-bit / 4-bit, and SmoothQuant on identical hardware are not included. We compare only against the same Q4_K_M baseline that GRC sits on top of.	The "calibration-free" claim is real (no other method skips calibration), but the "useful at production scale" claim cannot be fully ranked without compatible-runtime baselines.
Cross-hardware validation	All measurements are on RTX 4070 Laptop (32 MB L2, 256 GB/s GDDR6). The cache-fit predictions for RTX 4090, A100, H100 in Table 7.3 are calculated, not measured.	Without cross-hardware data, the cache-fit principle cannot be claimed as general, only as observed on this specific GPU.

Items 1, 3, and 4 require infrastructure (Nsight Compute access, multi-GPU benchmark cluster, AWQ/GPTQ runtime ports) outside the scope of an independent high-school project. Item 2 (task evaluations) is a near-term work item already on the roadmap.

§13, Future Work

Future Work

13.1 Cross-Hardware Cache-Fit Sweep

The highest-priority open question is whether the super-baseline effect at $k=1024$ is reproducible on other GPU microarchitectures. The predictions in Table 7.3 are derivable from cache size and bandwidth ratios, but must be empirically validated. A systematic sweep of $k$ values on RTX 4090, A100, and H100 would confirm or refute the cache-fit hypothesis and allow fitting a predictive model for hardware-optimal rank.

13.2 FFN Compression

FFN weights (gate, up, down projections; 14,336 × 4,096 for Llama-3.1-8B) have substantially flatter singular value spectra than attention weights, low-rank approximation at $k/n = 3.5\%$ explains only 3.5% of the Frobenius norm, making global SVD truncation unacceptably lossy.

Viable paths include: (a) block-diagonal decomposition, decompose each FFN weight into $B$ blocks and compress each independently, finding local low-rank structure; (b) input-adaptive sparse activation, identify and skip near-zero neurons per token (exploiting the superposition / monosemanticity structure); (c) FFN on CPU + attention on GPU, keep FFN in system RAM and run it on CPU while GPU handles attention-only GRC, accepting PCIe latency as a throughput tradeoff.

13.3 Per-Matrix Basis (Separate Q vs KV Subspaces)

The current implementation uses a shared $\mathbf{P}_t^{(\ell)}$ for Q, K, V in each layer. Because Q and KV matrices often operate in different subspaces (particularly in GQA architectures like Llama-3), per-matrix bases could significantly improve quality at the same rank, especially for Q (which showed 79--87% energy capture vs 95--97% for K/V at $k=2048$).

13.4 Rank-Adaptive Deployment

The cache-fit effect suggests a deployment strategy: at model serve time, project weights to the hardware's cache-fit rank rather than the training rank. This is a one-time offline step with deterministic output. Different hardware profiles would be served different projection ranks from the same base model. The W_proj cache infrastructure in GRC already supports this by keying caches on (model hash, rank).

13.5 Quality Recovery via Distillation

The +13.30% PPL penalty is structural given the current calibration-free basis. A subsequent few-shot distillation step, using the uncompressed model as teacher, could recover quality without full retraining, following the LoRA/QLoRA paradigm. The W_proj matrices are differentiable and could be fine-tuned directly.

§14, Reproducibility

Reproducing This Work

11.1 Requirements

Requirement	Detail
GPU	NVIDIA GPU, ≥8 GB VRAM, CUDA driver ≥520
Model	`bartowski/Meta-Llama-3.1-8B-Instruct-GGUF` (Q4_K_M, 4.58 GB)
Runtime	Geodessical binary or source build (Zig CC required for Windows)
Disk	~5.8 GB (model + W_proj cache)
First-run time	60--120 s CPU calibration; subsequent runs use disk cache

11.2 Key Commands

# Baseline throughput
.\build_host\geodessical.exe <model.gguf> -n 256 --temp 0 \
    -p "Write a sorting algorithm in Python"

# GRC k=1536 inference
.\build_host\geodessical.exe <model.gguf> -n 256 --temp 0 \
    -p "Write a sorting algorithm in Python" \
    --axex-compress --axex-attn-only --axex-skip-o \
    --axex-weight-pca --axex-compress-rank 1536

# Baseline perplexity
.\build_host\geodessical.exe <model.gguf> --ppl-eval

# GRC perplexity (k=1536 effective)
.\build_host\geodessical.exe <model.gguf> --ppl-eval \
    --axex-compress --axex-attn-only --axex-skip-o \
    --axex-weight-pca --axex-compress-rank 2048

# Full benchmark harness (rank sweep + CI + PPL, ~60 min)
.\scripts\benchmark_whitepaper_finalize.ps1 -CooldownSec 30

# Gate validator
.\scripts\validation_cycle.ps1 \
    -PackDir benchmarks\whitepaper_pack_20260427_121815

11.3 Expected Outputs

Reference values from validated pack whitepaper_pack_20260427_121815:

k=1024  decode: 106.27%  overall: 105.72%  prefill: 102.67%
k=1536  decode:  97.55%  overall:  95.80%  prefill: 114.61%
k=2048† decode: 101.04%  overall:  99.34%  prefill: 108.48%

coding/256    lower-95 decode retention: 86.60%
reasoning/256 lower-95 decode retention: 85.64%

PPL baseline: 6.7902  |  PPL GRC k=1024: 10.9585 (+61.39%)
PPL GRC k=1536: 7.6936 (+13.30%)  |  PPL GRC k=2048: 7.6936 (+13.30%, identical to k=1536)

A complete reproduction package is at repro/REPRODUCE.md with expected output CSVs in repro/expected_outputs/. Throughput tolerance: ±5% (GPU clock variance); PPL is deterministic to 4 decimal places.

§15, References

References

Grattafiori, A., et al. (2024). The Llama 3 Herd of Models. arXiv:2407.21783.
Ashkboos, S., Croci, M. L., do Nascimento, M. G., Hoefler, T., & Hensman, J. (2024). SliceGPT: Compress Large Language Models by Deleting Rows and Columns. ICLR 2024. arXiv:2401.15024.
Yuan, Z., Shang, Y., Song, Y., Wu, Q., Yan, Y., & Sun, G. (2023). ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models. arXiv:2312.05821.
Hsu, Y.-C., Hua, T., Chang, S., Lou, Q., Shen, Y., & Jin, H. (2022). Language model compression with weighted low-rank factorization (FWSVD). ICLR 2022. arXiv:2207.00112.
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323.
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978.
Hu, E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
Frantar, E., & Alistarh, D. (2023). SparseGPT: Massive Language Models Can be Accurately Pruned in One Shot. arXiv:2301.00774.
Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017.
Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 52(4), 65--76.
Yuan, Z., Shang, Y., Zhou, Y., Dong, Z., Xue, C., Wu, B., Li, Z., Gu, Q., Lee, Y. J., Yan, Y., Chen, B., Sun, G., & Keutzer, K. (2024). LLM Inference Unveiled: Survey and Roofline Model Insights. arXiv:2402.16363.
NVIDIA Corporation (2022). NVIDIA Ada GPU Architecture Whitepaper. nvidia.com / Ada Lovelace architecture documentation.
Gerganov, G., et al. (2023). llama.cpp k-quants (Q4_K_M, Q5_K_M, Q6_K) format specification. github.com/ggerganov/llama.cpp, PR #1684.
Geva, M., Schuster, R., Berant, J., & Levy, O. (2021). Transformer Feed-Forward Layers Are Key-Value Memories. EMNLP 2021. arXiv:2012.14913.
Kobayashi, S., Akram, A., & Yamashita, K. (2024). Weight Decay Induces Low-Rank Attention Layers. NeurIPS 2024. arXiv:2410.23819.
Gerganov, G., et al. (2023). GGUF binary format specification. github.com/ggerganov/ggml/blob/master/docs/gguf.md.
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM. Chapter on power iteration and the SVD/Gram-matrix relationship.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022. arXiv:2205.14135.

HyperTensor / Geodesic Runtime Compression

Abstract

Terms used in this report

The Simple Version: What We Built and Why It Matters

What is a large language model?

What is the problem?

What did we build?

What did we discover?

Introduction

Background: Transformers, Attention, and Memory Bandwidth

3.1 Transformer Decoder Architecture

3.2 Why Decode Throughput Is Memory-Bandwidth Limited

3.3 Principal Component Analysis and the Gram Matrix

Where GRC Sits in the Literature

3.5.1 Low-rank decomposition of LLM weights

3.5.2 Post-training quantisation

3.5.3 Adapter-style decomposition

3.5.4 Roofline / memory-bandwidth analysis

Method: Geodesic Runtime Compression

4.1 Scope

4.2 Basis Construction (Offline)

4.3 Runtime Inference Transform

4.4 Batch-Prefill Constraint

Experimental Setup

5.1 Hardware

5.2 Model

5.3 Measurement Protocol

Results

6.1 Throughput: Rank Sweep

6.2 Confidence Intervals (12-rep Sustained Load)

6.3 Quality: Perplexity

6.4 VRAM Profile

6.5 Power Draw

6.6 Validation Gate Summary

Why the Compressed Model Runs Faster (We Think)

7.1 The Puzzle

7.2 The Cache-Fit Hypothesis

7.3 The fp32-vs-Q4_K_M Caveat

7.4 Implications (carefully stated)

Why Attention Compresses but FFN Does Not

8.1 Quantitative summary

Hypothesis Tests on Throughput Gains

9.1 Test results

How Far Is the Shared Basis from the Optimum?

10.1 Numerical verification (layers 0, 15, 31; ranks 512--2048)

10.2 Interpretation

What This Work Contributes

Why This Might Matter (and Why It Might Not)

11.5.1 If the cache-fit story holds up

11.5.2 If it doesn't

11.5.3 Scope of the implications

Limitations

12.1 What Is and Is Not Demonstrated

12.2 Known Technical Limitations

Quality penalty (+13.30% PPL)

Prefill overhead (8--15% slower)

AXEX_MANIFOLD_K_MAX = 1536 hard cap

O_proj excluded

CUDA-only runtime

12.3 Methodological Gaps (What This Paper Does Not Establish)

Future Work

13.1 Cross-Hardware Cache-Fit Sweep

13.2 FFN Compression

13.3 Per-Matrix Basis (Separate Q vs KV Subspaces)

13.4 Rank-Adaptive Deployment

13.5 Quality Recovery via Distillation

Reproducing This Work

11.1 Requirements

11.2 Key Commands

11.3 Expected Outputs

References

`AXEX_MANIFOLD_K_MAX = 1536` hard cap