Golden set
50-500 (input, expected_output) pairs curated per task. Cover normal + edge cases. Update as new failures found.
Advertisement
Metrics
Exact match. F1. BLEU/ROUGE (text). Semantic similarity (embedding cosine). Task-specific classifiers.
Advertisement
LLM as judge
Use LLM to score outputs. Cheap + fast. Bias caveats: judge model preferences, position bias, verbosity bias. Calibrate.
Statistical significance
N samples, McNemar test or paired bootstrap. 5% metric change on 30-sample set is noise.