Minimal reproduction
Strip prompt to smallest failing case. Keep failing until 20-30 lines. Isolates issue.
Advertisement
Ablation
Remove each section, test. Find which section causes failure. Or add sections one by one; watch failure appear.
Advertisement
Model comparison
Same prompt on 3 models. Fails only on X → model quirk. Fails all → prompt bug. Passes all → intermittency (temperature/sampling).
Temperature = 0
Deterministic runs. Rules out sampling noise. Then vary systematically.