A good postmortem turns one incident into systemic improvement. A bad postmortem is theater with PowerPoint. The line between them is structure, blamelessness, and ruthless follow-through on action items.

Advertisement

Blameless means structural-cause focused

'Person X deployed a bug' is the wrong frame. 'Our deploy process allowed an unreviewed change to reach prod' is the right one. Same fact, different framing — and only the structural framing leads to fixes that prevent recurrence.

Standard sections

Summary (one paragraph). Timeline (UTC timestamps for every state change). Impact (users, duration, revenue if relevant). Root cause (the technical one). Contributing factors (the systemic ones). Action items (with owners and dates). Lessons learned.

Advertisement

Get the timeline right

Pull from logs, Slack, page history. UTC throughout (mixing timezones loses people). 'Alert fired' is a timeline entry. 'Engineer noticed alert' is a separate entry; the gap matters. Be specific: '14:23 — checkout error rate crossed 5%' not 'around 2pm errors started'.

Action items are commitments

Every action item: owner, due date, tracker link. Default: review status weekly until done. Action items that slip > 30 days are renegotiated or killed — open lists rot.

Anti-patterns

Postmortem doc as the deliverable (it's the process that matters). Action items like 'be more careful' (not actionable). Discussion of who pushed what button (blame creep). No tracking of follow-through (most common failure).

Structural framing, standard sections, UTC timeline, owned action items, weekly follow-through. The doc isn't the deliverable; the fixes are.