'Quantized model matches FP16 within 1% perplexity' is the standard claim. Perplexity is a smooth average; it hides task-specific regressions that matter in production. Real quantization evaluation is task-specific, statistical, and worth the time.
Perplexity is a starting point
Average per-token log-likelihood. Smooth, easy to compute, comparable across models. But: a model can have similar perplexity and very different behavior on long-context reasoning, code, math. Perplexity is necessary, not sufficient.
Task-specific benchmarks
Code: HumanEval, MBPP. Math: GSM8K, MATH. Reasoning: MMLU, BBH. Long-context: RULER, LongBench. Run a representative subset; quantization can cost more on some tasks than others (long-context and reasoning typically suffer most).
Production traffic replay
Real winning approach for production: capture 1000 production prompts (with PII scrubbing). Run through FP16 and quantized; compare outputs. Difference rate, length difference, sentiment difference. Catches domain-specific quality drops that benchmarks miss.
Statistical significance
'Quantized model is 0.5% worse' — over how many samples? With confidence interval? Most quantization comparisons report point estimates without intervals. Get N>500 per metric, compute confidence intervals. Many 'regressions' are noise.
Deployment regression catching
Even after passing benchmarks, monitor production: user feedback rate, regeneration rate, downstream task success. Quantization can degrade specific user segments more than aggregate. Set up dashboards before quantization rollout, not after.