Tom Wilkie's RED (Rate, Errors, Duration) and Brendan Gregg's USE (Utilization, Saturation, Errors) are the two most cited 'what to monitor' frameworks. They answer different questions and are best used together.
RED — for services
Rate: requests per second. Errors: % of failed requests. Duration: latency distribution (p50, p95, p99). One service = three core metrics. If you can't see all three, you can't tell if it's healthy.
USE — for resources
Utilization: % of resource in use (CPU, mem, disk). Saturation: queue depth, waiting work (CPU run queue, I/O wait). Errors: hardware errors, retries. Resource = one node, disk, network interface. Catches bottlenecks before they cause service-level issues.
Why both
RED tells you 'service is slow'. USE tells you 'why' — CPU saturated, disk full, network errored. RED is user-facing, USE is operator-facing. SREs need both; engineers usually only see RED.
Implementation
RED: instrument every endpoint with rate/error/duration histograms (Prometheus' RED dashboards). USE: collect node_exporter / cAdvisor / NVMe metrics. Wire both into the same Grafana — one row per service (RED), one row per node (USE).
The four golden signals
Google SRE book extends RED with Saturation as a fourth signal — bridging the gap. RED + Saturation = 'Four Golden Signals'. Largely a notational difference from RED+USE; concept is the same.