Observability from Scratch

A new service deserves observability from day 1. Most teams either skip it (and regret on day 2 of the first incident) or over-install (and waste setup time). Here's the staged approach that actually works.

Advertisement

Day 1 — telemetry baseline

Structured JSON logs to stdout. OpenTelemetry SDK for traces (auto-instrumentation if available). RED metrics: rate, errors, duration. Health check endpoint. One synthetic check pinging it.

Day 30 — alerting and dashboards

Burn-rate alert on availability/latency SLO. Dashboard with RED + service-specific KPIs. Runbook (linked from alert) for each alert. Test fire one alert to make sure pager works.

Advertisement

Day 90 — depth

Tail-sampled tracing (collector tier). Continuous profiling. Log-based metrics for specific error patterns. Internal user docs explaining what each dashboard means. Quarterly review of alert noise.

Day 1: telemetry. Day 30: alerts + dashboards + runbooks. Day 90: depth + review. Don't skip day 1.