Evaluating Small Models: Common Pitfalls

Benchmarks designed for frontier models often underestimate small models on real-world tasks. Picking a small model based on MMLU score alone misses where small models shine — and where they fail. Better evaluation methodology directly impacts product decisions.

Advertisement

MMLU isn't your workload

MMLU tests broad academic knowledge. If your workload is summarizing emails or extracting entities, MMLU score barely correlates with task performance. Build a task-specific eval set; that's the only score that matters.

Length bias

Many evals penalize long-form responses. Small models that produce verbose-but-correct answers score worse than terse-and-incorrect ones. Check eval scoring; many move to LLM-as-judge with task-appropriate rubrics.

Advertisement

Quantization isn't free even when benchmarks say it is

Q4 model scores within 1% of Q8 on benchmarks. Then your specific edge cases (long context, code, multi-step reasoning) show 10% degradation. Always re-eval on your task post-quantization, not just at recipe-published bench scores.

Prompt engineering hides capability gaps

Big model: 'do task X' works. Small model: 'do task X' fails; 'do task X step by step using this template' succeeds. Evaluating with the same prompt for both is unfair to the small model. Allow per-model prompts in production.

Cost-adjusted scoring

70B vs 7B is a 10x cost difference. If 70B is 5% better on your eval, 7B might be the right choice. Plot cost vs quality, not raw quality. Most teams pick the wrong model because they only look at one axis.

Eval on YOUR task, allow per-model prompting, re-eval post-quantization, plot cost-vs-quality. MMLU is for blog posts.

MMLU isn&#x27;t your workload

Length bias

Quantization isn&#x27;t free even when benchmarks say it is

Prompt engineering hides capability gaps

Cost-adjusted scoring

MMLU isn't your workload

Quantization isn't free even when benchmarks say it is