Split
10% → new prompt, 90% → old. Increase gradually. Log which version served each request.
Advertisement
Metrics
Task success rate. Latency. Cost. User feedback (thumbs up). Downstream conversion. Task-appropriate.
Advertisement
Sample size
Power analysis: for 5% detection at 80% power, need N depending on baseline rate. Small effects require thousands.
Guardrail metrics
Track hallucination rate, refusal rate, cost per query. Reject change if any regresses.