A dead letter queue (DLQ) is where messages go when they can't be processed. Without one: bad messages block consumers, retry loops storm services, operational visibility disappears. With one: failures are quarantined, observable, replayable. The pattern is universal across message systems.
When to DLQ
Permanently bad messages (malformed, schema mismatch, references missing data). Repeatedly failing messages (after N retries). Messages that violate invariants (negative amount, missing required field). Don't DLQ transient errors — those should retry.
Setup pattern
Per-topic DLQ, named my-topic.dlq or similar. Failed message + failure metadata (error, attempt count, original headers) written to DLQ. Original consumer commits past the failure; processing continues for other messages.
DLQ retention
Long enough for human investigation. Days to weeks typical. Watch DLQ size — sudden growth means a new failure mode worth investigating. Don't auto-delete; you'll lose data needed for incident analysis.
Reprocessing
Fix the underlying bug. Replay DLQ back to the original topic for re-processing. Tool: kafka-streams-DLQ-reprocessor or custom script. Critical: idempotent consumers, so reprocessing doesn't double-act.
Anti-patterns
Silent message loss (no DLQ). Single shared DLQ across topics (loses traceability). Reprocessing without idempotency (corruption). Discarding DLQ messages without analysis (you missed the bug pattern).