Spot-checking 10 agent traces tells you nothing about quality. Production-scale evaluation is its own discipline — trace replay, LLM-judge calibration, regression catching, drift monitoring. The mature stack has 5-10 layers.
Advertisement
Trace capture and replay
Capture every prompt, response, tool call. Replay against new model/prompt versions. Find regressions before deploy. Tools: Langfuse, Arize Phoenix, OpenLLMetry.
LLM-judge with calibration
Use strong model to grade traces against rubric. Calibrate against human grades on a sample (target >0.7 Pearson correlation). Re-calibrate when judge model changes.
Advertisement
Drift monitoring
Production quality scores tracked over time. Sudden drop = model provider changed something. Slow drop = input distribution shifted. Both need investigation; both are easy to miss without tracking.
Trace + replay + calibrated judge + drift tracking. Spot-check is a starting point, not a finish line.