Scope
This guide reproduces the per-task PPL impact measurements across 8 benchmark tasks (LAMBADA, HellaSwag, PIQA, ARC-E, ARC-C, WinoGrande, MMLU, GSM8K) for GRC-compressed models at k=256, 512, 768, 1024. Validates that knowledge-intensive tasks degrade faster than reasoning tasks, confirming the zone-specialisation hypothesis from Paper XI (UGT).
Hardware target
- Reference GPU: RTX 4070 Laptop 8GB or L40S 46GB.
- Minimum: any CUDA 12.x GPU with ≥8GB VRAM.
- Host: 16GB RAM, 10GB free disk for lm_eval harness.
Prerequisites
- CUDA driver 552+, PowerShell 7+.
lm_evalinstalled (pip install lm_eval).- GGUF model at desired quantisation (Q4_K_M recommended).
- Geodessical binary built (
build_host.ps1).
Step 1: Build the harness
cd scripts
python run_task_impact.py --model ../models/qwen2.5-7b-q4_k_m.gguf \
--k-values 256,512,768,1024 --tasks lambada,hellaswag,piqa,arc_e,arc_c,winogrande,mmlu,gsm8k \
--output ../benchmarks/task_impact.csv
Step 2: Expected output
The CSV contains per-task PPL and accuracy for each k. Key expectations:
- LAMBADA PPL increases ~2-5% at k=512, ~8-15% at k=256.
- Reasoning tasks (ARC, GSM8K) degrade <3% at k=512.
- Knowledge tasks (MMLU) degrade ~8-12% at k=512 --- confirms zone pattern.
- All tasks recover to within 1% of baseline at k=1024.
Validation
Run the benchmark_graph.py script to generate the task-vs-k plot and verify the zone-specialisation curve matches the paper's Figure 3.
python scripts/benchmark_graph.py --input benchmarks/task_impact.csv \
--output docs/figures/task_impact.png