Split

10% → new prompt, 90% → old. Increase gradually. Log which version served each request.

Advertisement

Metrics

Task success rate. Latency. Cost. User feedback (thumbs up). Downstream conversion. Task-appropriate.

Advertisement

Sample size

Power analysis: for 5% detection at 80% power, need N depending on baseline rate. Small effects require thousands.

Guardrail metrics

Track hallucination rate, refusal rate, cost per query. Reject change if any regresses.