A new service deserves observability from day 1. Most teams either skip it (and regret on day 2 of the first incident) or over-install (and waste setup time). Here's the staged approach that actually works.
Day 1 — telemetry baseline
Structured JSON logs to stdout. OpenTelemetry SDK for traces (auto-instrumentation if available). RED metrics: rate, errors, duration. Health check endpoint. One synthetic check pinging it.
Day 30 — alerting and dashboards
Burn-rate alert on availability/latency SLO. Dashboard with RED + service-specific KPIs. Runbook (linked from alert) for each alert. Test fire one alert to make sure pager works.
Day 90 — depth
Tail-sampled tracing (collector tier). Continuous profiling. Log-based metrics for specific error patterns. Internal user docs explaining what each dashboard means. Quarterly review of alert noise.