Benchmarks designed for frontier models often underestimate small models on real-world tasks. Picking a small model based on MMLU score alone misses where small models shine — and where they fail. Better evaluation methodology directly impacts product decisions.
MMLU isn't your workload
MMLU tests broad academic knowledge. If your workload is summarizing emails or extracting entities, MMLU score barely correlates with task performance. Build a task-specific eval set; that's the only score that matters.
Length bias
Many evals penalize long-form responses. Small models that produce verbose-but-correct answers score worse than terse-and-incorrect ones. Check eval scoring; many move to LLM-as-judge with task-appropriate rubrics.
Quantization isn't free even when benchmarks say it is
Q4 model scores within 1% of Q8 on benchmarks. Then your specific edge cases (long context, code, multi-step reasoning) show 10% degradation. Always re-eval on your task post-quantization, not just at recipe-published bench scores.
Prompt engineering hides capability gaps
Big model: 'do task X' works. Small model: 'do task X' fails; 'do task X step by step using this template' succeeds. Evaluating with the same prompt for both is unfair to the small model. Allow per-model prompts in production.
Cost-adjusted scoring
70B vs 7B is a 10x cost difference. If 70B is 5% better on your eval, 7B might be the right choice. Plot cost vs quality, not raw quality. Most teams pick the wrong model because they only look at one axis.