Status
Paper E is a scaffold companion to Paper A. The Phase 1 reference
implementation (calibration-free GRC projection in numpy) is functional
and ships in scripts/grc_distill.py. Phase 2 (LoRA-residual
fit on Q/K/V against teacher logit MSE) requires a torch + GPU runtime
and is not yet end-to-end. Phase 3 (re-quantise and ship a single GGUF)
delegates to llama.cpp.
Hardware target
- Phase 1: any x86_64 host with Python 3.11+ and 16 GB RAM. CPU only. Runs in 60-120 s for an 8B model.
- Phase 2: 1x A100/L40S/H100 with 24 GB VRAM and PyTorch 2.4. Estimated wall clock for 500 steps at LoRA r=8: 30-60 min on A100.
- Phase 3: same host as Phase 2, plus a working
llama.cppbuild with thequantizetool.
Quick start (Phase 1, CPU-only)
# 1. From the repo root
python -m venv .venv
.venv\Scripts\activate
pip install numpy
# 2. Run the calibration-free reference (k=1024, no sink-channel)
python scripts/grc_distill.py \
--model models/Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--rank 1024 \
--out distill_out/
# 3. Inspect the manifest
type distill_out\distill_manifest.json
Phase 2 (GPU, calibration-permitted)
# Requires PyTorch + transformers; runs on EC2 A100/L40S
python scripts/grc_distill.py \
--model models/Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--rank 1024 \
--distill \
--corpus calibration/wikitext_calib_2k.txt \
--lora-rank 8 \
--steps 500 \
--device cuda \
--out distill_out/
Expected outputs
distill_out/distill_manifest.json: configuration record + stage marker (scaffold,calibration_free, ordistill_requested).- Pending (Phase 2): per-layer LoRA residual weights and a final perplexity delta against the un-distilled GRC k=1024 baseline. Expected: closure of approximately half of the 61.4% PPL gap, hitting roughly +30% over the uncompressed baseline at unchanged decode throughput.
Caveats
- The empirical Phase-2 results in Paper E are explicitly listed as Pending until the EC2 A100 runner is functional.
- The Wikitext-2 calibration corpus citation in the paper triggers a pandoc warning; this is a known harmless reference issue and does not affect the rendered PDF or HTML.