A runbook that nobody reads during an incident is worse than no runbook. It creates false confidence. Here is how to write runbooks that on-call engineers actually use at 3 AM.
The Template
Every runbook should follow the same structure:
- What is this service? One sentence. Not a paragraph, not a diagram. One sentence.
- What does this alert mean? Translate the alert into human language.
- First check (30 seconds): The one command that tells you if this is real or a false alarm. Usually
kubectl get podsor a health check curl. - Likely causes (ranked by frequency): The top reasons this alert fires, based on historical data. Not theoretical but based on actual past incidents.
- Fix for each cause: Step-by-step commands. Copy-pasteable. No "use your judgment" instructions.
- Escalation: When to wake someone up, and who.
The Anti-Patterns
The novel. A long runbook that explains the entire service architecture. Nobody reads this at 3 AM. Keep it under two pages.
The optimist. "This alert rarely fires." It fires at 3 AM on a Friday. If it did not fire, there would not be a runbook.
The perfectionist. "Investigate the root cause before taking action." At 3 AM, the goal is mitigation, not root cause. Restart the service, restore the backup, scale the pods. Investigate tomorrow.
The Maintenance
Runbooks rot faster than code. After every incident, the on-call engineer should update the runbook with what actually happened and what actually fixed it. This is part of the incident retrospective: not optional, not "nice to have." If the runbook was wrong, fixing it is the most valuable output of the retrospective.