Reproduce F · Per-Task Impact

Reproduce Paper F: Per-Task Impact of GRC Compression

William Ken Ohara Stewart (NagusameCS Independent Research)

HyperTensor Project · May 2026 · Paper F (HTML) · repro tree

Scope

This guide reproduces the per-task PPL impact measurements across 8 benchmark tasks (LAMBADA, HellaSwag, PIQA, ARC-E, ARC-C, WinoGrande, MMLU, GSM8K) for GRC-compressed models at k=256, 512, 768, 1024. Validates that knowledge-intensive tasks degrade faster than reasoning tasks, confirming the zone-specialisation hypothesis from Paper XI (UGT).

Hardware target

Prerequisites

Step 1: Build the harness

cd scripts
python run_task_impact.py --model ../models/qwen2.5-7b-q4_k_m.gguf \
    --k-values 256,512,768,1024 --tasks lambada,hellaswag,piqa,arc_e,arc_c,winogrande,mmlu,gsm8k \
    --output ../benchmarks/task_impact.csv

Step 2: Expected output

The CSV contains per-task PPL and accuracy for each k. Key expectations:

Validation

Run the benchmark_graph.py script to generate the task-vs-k plot and verify the zone-specialisation curve matches the paper's Figure 3.

python scripts/benchmark_graph.py --input benchmarks/task_impact.csv \
    --output docs/figures/task_impact.png