Golden set

50-500 (input, expected_output) pairs curated per task. Cover normal + edge cases. Update as new failures found.

Advertisement

Metrics

Exact match. F1. BLEU/ROUGE (text). Semantic similarity (embedding cosine). Task-specific classifiers.

Advertisement

LLM as judge

Use LLM to score outputs. Cheap + fast. Bias caveats: judge model preferences, position bias, verbosity bias. Calibrate.

Statistical significance

N samples, McNemar test or paired bootstrap. 5% metric change on 30-sample set is noise.