Minimal reproduction

Strip prompt to smallest failing case. Keep failing until 20-30 lines. Isolates issue.

Advertisement

Ablation

Remove each section, test. Find which section causes failure. Or add sections one by one; watch failure appear.

Advertisement

Model comparison

Same prompt on 3 models. Fails only on X → model quirk. Fails all → prompt bug. Passes all → intermittency (temperature/sampling).

Temperature = 0

Deterministic runs. Rules out sampling noise. Then vary systematically.