Public benchmarks (MMLU, MT-Bench, HumanEval) are useful for vendor comparison and almost useless for picking a model for your task. Your task-specific eval is the only metric that matters once a model is in production candidate range.

Advertisement

Public benchmark hygiene

MMLU saturates above 85%; gaps between leading models are noise. HumanEval is contaminated. MT-Bench score correlates better with chat preference. Pick by recent benchmark + clean leaderboard (LMSYS, HELM).

Task-specific eval design

100-500 examples representative of production. Mix easy + hard + adversarial. Grade with LLM-as-judge calibrated against humans. Rubric clear enough that two graders agree.

Advertisement

Continuous eval in production

Sample 1% of prod traffic. Grade asynchronously (human, model, or both). Track regression weekly. Re-eval whenever model or prompt changes.

Public benchmarks for shortlist, task-specific eval for decision, continuous eval for drift.