SLA (you pay if X breaks), SLO (we will keep X above threshold), SLI (measurement of X). They build on each other. Most teams confuse them; getting the hierarchy right is what turns 'we want 99.9% uptime' from a wish into an actionable system.

Advertisement

SLI: the measurement

Service Level Indicator = a specific numeric measurement. Examples: fraction of requests with status < 500, p95 latency of POST /checkout, % of pages with first-byte under 200ms. Picking the SLI is the hard part — it must match user pain.

SLO: the target

Service Level Objective = the SLI's target threshold. 99.9% of requests have status < 500 over 30 days. Two parameters: target value (99.9%) and window (30 days). Tighter than SLA — internal goal, not commercial promise.

Advertisement

SLA: the commercial promise

Service Level Agreement = a customer-facing contract with refund/credit penalties. If uptime drops below 99.5% in a month, customer gets 10% credit. Should always be looser than your SLO; the gap is your safety margin.

Error budget

If SLO = 99.9%, you have 0.1% error budget per window. That's 43 minutes/month of allowed downtime. Spent on bug fixes, deploys, infra changes — every risky action draws from the budget. When budget is exhausted, freeze risky deploys until next window.

Per-team SLO ownership

Each service team owns its SLOs and budget. Platform team aggregates. Product team negotiates SLOs with product requirements ('payment must be 99.99%; help center can be 99.5%'). This forces the conversation about reliability cost upfront.

SLI measures, SLO targets, SLA promises. Error budget converts reliability into a finite resource you can spend.