Prometheus at Scale — Mimir, Cortex, Thanos

Single-node Prometheus is great until it's not. A 14-day retention limit and one node's worth of memory cap the natural use. Mimir, Cortex, and Thanos solve the scaling problem in three different shapes; picking among them shapes operational cost for years.

Advertisement

Single-node Prometheus limits

Memory: ~10M active series per 64GB node. Retention: practically 2-4 weeks before disk and queries get slow. HA: scrape duplication, not data replication. Past these, you're scaling out.

Remote write — the shared on-ramp

Prometheus pushes samples via remote_write to a long-term-storage backend. All three big systems consume this protocol. The push side is solved; the storage side is where they differ.

Advertisement

Mimir vs Cortex vs Thanos

Mimir (Grafana Labs): multi-tenant from the start, S3-backed blocks, horizontally scalable ingesters. The most operationally polished option in 2026. Cortex: older sibling, similar architecture, less active. Thanos: sidecar pattern, uploads Prometheus blocks to S3 directly. Simpler to bolt on; harder to scale to many tenants.

Query federation

All three federate queries across many backends. Push-down predicates matter for performance. PromQL compatibility is high but not 100% — test critical queries against the new backend before commit.

Choosing in 2026

New deployment, multi-tenant or large: Mimir. Existing Prometheus + want long retention: Thanos (smallest lift). Cortex existing deployments are largely migrating to Mimir.

Mimir for greenfield at scale, Thanos for incremental add-on, Cortex is being supplanted. Remote write is the common on-ramp.