Reproduce Paper F: Task Impact Analysis, HyperTensor Research

Scope

This guide reproduces the per-task PPL impact measurements across 8 benchmark tasks (LAMBADA, HellaSwag, PIQA, ARC-E, ARC-C, WinoGrande, MMLU, GSM8K) for GRC-compressed models at k=256, 512, 768, 1024. Validates that knowledge-intensive tasks degrade faster than reasoning tasks, confirming the zone-specialisation hypothesis from Paper XI (UGT).

Hardware target

Reference GPU: RTX 4070 Laptop 8GB or L40S 46GB.
Minimum: any CUDA 12.x GPU with ≥8GB VRAM.
Host: 16GB RAM, 10GB free disk for lm_eval harness.

Prerequisites

CUDA driver 552+, PowerShell 7+.
lm_eval installed (pip install lm_eval).
GGUF model at desired quantisation (Q4_K_M recommended).
Geodessical binary built (build_host.ps1).

Step 1: Build the harness

cd scripts
python run_task_impact.py --model ../models/qwen2.5-7b-q4_k_m.gguf \
    --k-values 256,512,768,1024 --tasks lambada,hellaswag,piqa,arc_e,arc_c,winogrande,mmlu,gsm8k \
    --output ../benchmarks/task_impact.csv

Step 2: Expected output

The CSV contains per-task PPL and accuracy for each k. Key expectations:

LAMBADA PPL increases ~2-5% at k=512, ~8-15% at k=256.
Reasoning tasks (ARC, GSM8K) degrade <3% at k=512.
Knowledge tasks (MMLU) degrade ~8-12% at k=512 --- confirms zone pattern.
All tasks recover to within 1% of baseline at k=1024.

Validation

Run the benchmark_graph.py script to generate the task-vs-k plot and verify the zone-specialisation curve matches the paper's Figure 3.

python scripts/benchmark_graph.py --input benchmarks/task_impact.csv \
    --output docs/figures/task_impact.png