Writing runbooks that actually get used

Document structure visualization of organized runbook steps

The runbook was written by the engineer who understood the incident best, at 11pm the night after the postmortem, when the causal chain was still fresh. Six months later, a different engineer gets paged for the same pattern at 3am. They open the runbook. It says: "Check database connection pool metrics and verify config values are as expected." They close the runbook. It wasn't useful.

Bad runbooks don't fail because the author didn't know the answer. They fail because the author wrote for their future self — someone with full context about the system's architecture, current state, and recent history. The engineer reading the runbook at 3am is not that person. They're someone with partial context, incomplete sleep, and a pager that fired 4 minutes ago.

Why runbooks get ignored

We surveyed on-call engineers about their runbook usage patterns. The patterns that came back were consistent: engineers stop reaching for a runbook after the first two or three times it didn't help them. Once they've learned that a particular runbook contains abstract guidance rather than specific commands, they stop opening it. The "I'll just figure it out myself" response isn't laziness — it's learned behavior after repeatedly experiencing that the runbook costs more time than it saves.

The most common failure modes in unusable runbooks: they describe the system rather than the problem, they give diagnostic guidance without giving remediation steps, they assume familiarity with tooling that the reader may not have, they reference external documents that are themselves outdated, and they conflate the specific alert condition that triggers the runbook with unrelated scenarios that happen to look similar.

A runbook that says "high error rate on the payment service" and then gives a five-paragraph description of how the payment service architecture works is not a runbook. It's documentation that happens to live in the runbook folder. Architecture documentation belongs in Confluence. Runbooks belong next to alerts, and they should start from where the alert leaves off.

The 3am cognitive model

Writing a useful runbook requires modeling the state of the person who will read it. At 3am, after being paged by a PagerDuty alert, an on-call engineer has: a phone screen showing the alert title and a link, probably 60% of normal cognitive capacity, a strong preference for concrete actions over analysis, no patience for narrative background, and a desire to either fix the problem quickly or correctly escalate it without feeling like they failed.

They need answers to three questions in order: (1) Is this the right runbook for this alert? (2) What do I do first? (3) When do I escalate? Everything else in a runbook is secondary. Elaborate background sections, architectural diagrams, and historical incident narratives should either be absent or placed at the end, after the action steps.

The structure that works: a one-sentence description of the alert condition and why it matters, then numbered action steps starting with "Step 1: run this specific command, observe this output." Not "check the metrics." Run this command, observe this specific output, interpret it as either "normal" or "abnormal" with specific thresholds defined.

Concrete commands beat abstract guidance every time

The single most effective change you can make to a runbook is replacing abstract guidance with specific commands. Compare:

Abstract: "Check the database connection pool utilization metrics and verify they are within expected bounds."

Specific:

kubectl exec -n payments deploy/payment-api -- \
  curl -s localhost:9090/metrics | grep db_pool_active_connections
# Expected: < 45 (out of 50 max)
# If > 48: connection pool exhaustion — proceed to Step 3

The specific version gives the engineer the exact command, the namespace, the metric name, the threshold, and the decision branch. It takes 30 seconds to execute instead of 5 minutes to interpret. Across the entire runbook, this difference compounds into the difference between a 15-minute resolution and a 45-minute one.

Some teams resist this level of specificity because they worry the commands will go stale as the system evolves. This is a real concern, but it's overstated. A stale command that fails with an error is more useful than abstract guidance that succeeds at wasting your engineer's time. The stale command gives a clear failure signal that the runbook needs updating. Abstract guidance just silently fails to help.

Attaching runbooks to correlated incident context

Runbooks are significantly more useful when they arrive with incident context already populated. A runbook that starts with "here is the specific pod that OOMKilled, here is the namespace, here is the config change that preceded it, here is the last time this same pattern fired" requires far less initial investigation than a runbook that starts from scratch.

This is why runbook links in the correlated incident card matter. When an alert fires and the correlated incident already contains the relevant service names, the preceding config change, and the topology context, the runbook can reference those fields directly. "Run this command against the service named in the incident title" is better than "run this command against the payment-api service" because it works even if the runbook is being consulted for a different service that matched the same alert pattern.

Runbooks that are built to receive incident context — parameterized with service name, namespace, and incident timestamp — are more reusable than runbooks hardcoded to a specific service. They're also more likely to be maintained, because they're being actively used across multiple services rather than sitting dormant for the one specific scenario they were written for.

The escalation decision: make it explicit

One of the most common on-call failure modes is delayed escalation: an engineer follows a runbook's troubleshooting steps for 40 minutes before concluding they need more help, when the escalation threshold was crossed at minute 10. Clear escalation criteria prevent this. The runbook should specify: "If Step 2 doesn't resolve the issue within 5 minutes, escalate to the database on-call. If you cannot run Step 1 because of access issues, escalate immediately."

Escalation criteria work best when they're time-bounded rather than outcome-bounded. "Escalate if you can't find the cause" leaves the engineer to judge when they've looked hard enough. "Escalate if the service hasn't recovered after 10 minutes from the start of this runbook" gives a clear trigger that doesn't require the engineer to assess their own investigation quality under pressure.

Keeping runbooks alive

Runbooks go stale when they're not connected to the incident workflow. If accessing and using a runbook is a manual step that requires knowing the runbook exists and finding it in a documentation system, most engineers won't do it consistently. If the runbook link surfaces automatically in the correlated incident card — because it's associated with the alert pattern in your alerting system — it becomes a natural part of the workflow.

The second maintenance driver is post-incident review. After every incident where a runbook was consulted, the postmortem should include a one-sentence assessment: did the runbook help? If not, why not? This review doesn't have to be exhaustive — just a pass/fail with a short note. Over time, runbooks that consistently fail get updated. The correlation between "runbook was consulted, incident resolved quickly" and "runbook was not consulted or didn't help, incident resolution took longer than expected" is a useful proxy for runbook quality. If you track it, you'll know which runbooks need investment and which ones are working.