A good postmortem turns one incident into systemic improvement. A bad postmortem is theater with PowerPoint. The line between them is structure, blamelessness, and ruthless follow-through on action items.
Blameless means structural-cause focused
'Person X deployed a bug' is the wrong frame. 'Our deploy process allowed an unreviewed change to reach prod' is the right one. Same fact, different framing — and only the structural framing leads to fixes that prevent recurrence.
Standard sections
Summary (one paragraph). Timeline (UTC timestamps for every state change). Impact (users, duration, revenue if relevant). Root cause (the technical one). Contributing factors (the systemic ones). Action items (with owners and dates). Lessons learned.
Get the timeline right
Pull from logs, Slack, page history. UTC throughout (mixing timezones loses people). 'Alert fired' is a timeline entry. 'Engineer noticed alert' is a separate entry; the gap matters. Be specific: '14:23 — checkout error rate crossed 5%' not 'around 2pm errors started'.
Action items are commitments
Every action item: owner, due date, tracker link. Default: review status weekly until done. Action items that slip > 30 days are renegotiated or killed — open lists rot.
Anti-patterns
Postmortem doc as the deliverable (it's the process that matters). Action items like 'be more careful' (not actionable). Discussion of who pushed what button (blame creep). No tracking of follow-through (most common failure).