← Back to portfolio 2025-05-12

The Art of the Runbook: Documentation That Saves 3AM Pages

SREOn-CallDocumentationIncident Response

A runbook that nobody reads during an incident is worse than no runbook. It creates false confidence. Here is how to write runbooks that on-call engineers actually use at 3 AM.

The Template

Every runbook should follow the same structure:

What is this service? One sentence. Not a paragraph, not a diagram. One sentence.
What does this alert mean? Translate the alert into human language.
First check (30 seconds): The one command that tells you if this is real or a false alarm. Usually kubectl get pods or a health check curl.
Likely causes (ranked by frequency): The top reasons this alert fires, based on historical data. Not theoretical but based on actual past incidents.
Fix for each cause: Step-by-step commands. Copy-pasteable. No "use your judgment" instructions.
Escalation: When to wake someone up, and who.

The Anti-Patterns

The novel. A long runbook that explains the entire service architecture. Nobody reads this at 3 AM. Keep it under two pages.

The optimist. "This alert rarely fires." It fires at 3 AM on a Friday. If it did not fire, there would not be a runbook.

The perfectionist. "Investigate the root cause before taking action." At 3 AM, the goal is mitigation, not root cause. Restart the service, restore the backup, scale the pods. Investigate tomorrow.

The Maintenance

Runbooks rot faster than code. After every incident, the on-call engineer should update the runbook with what actually happened and what actually fixed it. This is part of the incident retrospective: not optional, not "nice to have." If the runbook was wrong, fixing it is the most valuable output of the retrospective.