SLIs, SLOs, and Error Budgets

SRE books make SLOs sound deterministic. In practice, picking the right SLI, setting an SLO that matters, and enforcing the error budget against feature velocity is mostly a social/process problem with some math. The math part isn't hard; the rest takes a year.

Advertisement

SLI: pick a user-impact signal

Latency p99 of the critical user request. Availability of the homepage. Error rate on checkout. Not: 'CPU utilization', 'queue depth', 'cache hit rate' — those are causes, not user impact.

SLO: pick a number you can defend

99.9% availability = 43 min/month downtime. 99.95% = 21 min. The number should match user expectation and competitive landscape, not be aspirational. Higher SLO = exponentially more engineering cost.

Advertisement

Error budget = (1 - SLO) × time

99.9% SLO = 0.1% × 30 days = 43 minutes/month error budget. Spend it on planned maintenance, risky deploys, novel features. Budget exhausted = freeze feature work and fix reliability. The mechanism only works if leadership respects the freeze.

Burn rate alerts

Don't alert on 'SLO violated'. Alert on burn rate: 'consuming 2 weeks of budget per hour'. Multi-window alerts (fast burn over 5 min AND slow burn over 1 hour) reduce noise.

Common failure modes

SLO that's always green (set too low). SLO that's always red (set too high or wrong metric). Error budget never enforced (cultural failure). SLO on internal metric users don't see ('cache hit rate'). All of these mean the SLO isn't useful.

User-impact SLI + defensible SLO + burn-rate alerts + actual enforcement. The math is easy; the social contract is the work.