On-call burnout is a tooling problem

Abstract visualization of alert volume reduction over time

On-call burnout is treated as a people management problem. The solutions offered are mostly scheduling interventions: smaller rotation windows, stricter handoff protocols, more explicit off-hours boundaries, mental health days after major incidents. These are good practices. They're also largely treating the symptom rather than the cause.

The primary driver of on-call burnout isn't the number of hours spent on call. It's the quality of those hours. An engineer who gets paged twice a night and resolves both incidents in 10 minutes each, with clear context and straightforward remediation, is less burned out after a week of on-call than an engineer who gets paged four times a night, spends 45 minutes on each incident assembling context before they can even form a hypothesis, and wakes up additional people for context they shouldn't have to ask for. Same hours on call. Radically different experience.

The signal quality tax

Every noisy alert your monitoring stack fires is a withdrawal from your on-call engineers' trust account. The first few false positive pages, your engineers respond immediately and investigate thoroughly. After a pattern develops — the disk_usage_high alert fires on the scratch volume at 3am, it's always a false positive because the scratch volume is expected to fill up on batch jobs — engineers start building their own mental noise filter. They learn which alerts are probably real and which are probably noise.

This mental noise filter is a symptom of a system design failure, not a feature. Every engineer who has developed a mental noise filter is a potential silent failure in your alerting system. The alert that looks like the scratch volume false positive but is actually a different disk path filling up will be dismissed for 20 minutes before the engineer realizes it's different. That's the burnout tax paid in incident quality, not just in fatigue.

Noisy alerting also causes engineers to subconsciously delay acknowledgement. If you monitor time-to-acknowledge across your on-call rotation and find that certain alert types have systematically longer acknowledgement times, those are the alerts your on-call team has learned to distrust. Not because they're being irresponsible — because they're rationally conserving energy for the alerts that historically required action. This is learned helplessness at the signal level, and it's a tooling failure masquerading as an attitude problem.

The context assembly tax

Beyond alert noise, the second major burnout driver is context assembly overhead. When an engineer is paged at 2am and the alert gives them a service name and an error rate threshold, the first 15–30 minutes of their incident response is pure cognitive overhead: opening dashboards, querying metrics, checking deploy history, reading logs, determining which of the five possible causes this particular symptom pattern corresponds to.

This is exhausting in a way that's distinct from the fatigue of being awake at 2am. It's the exhaustion of high-stakes problem-solving under incomplete information, time pressure, and interrupted sleep. Fifteen minutes of this cognitive state is more draining than 45 minutes of straightforward remediation work. Engineering teams don't usually separate these in their incident post-mortems, but the people doing on-call know the difference.

An on-call rotation where pages routinely arrive with correlated context — "here's the service, here's the correlated infrastructure event, here's the config change that preceded it, here's a runbook link" — is a qualitatively different experience from one where pages arrive with a single metric threshold breach. Same number of pages, same total time awake. But the cognitive load per page is dramatically lower when the context is pre-assembled rather than manually reconstructed.

Quantifying the burnout-tooling relationship

If you want to understand whether tooling is driving burnout on your team, you can measure it indirectly through several proxies. First: alert-to-incident ratio over time. A ratio above 5 (five alert firings per true incident) indicates significant noise that's consuming cognitive resources. Second: time-to-acknowledge distribution for your on-call alerts. High variance or systematically slow acknowledgement for certain alert types is a signal of distrust. Third: engineer-reported "effective incident time" — not time awake during an incident, but time spent actually knowing what to do versus time spent figuring out what's happening. Fourth: post-rotation attrition — if engineers consistently ask to come off on-call rotation after a stint, or if you have trouble filling rotation slots, that's a direct indicator that the on-call experience is not sustainable.

We've spoken with platform teams where the alert-to-incident ratio was running at 18:1. Eighteen alert firings for every real incident. Every night that an incident occurred, an engineer was woken up 18 times — many of those pages leading to investigations that went nowhere. Three months after implementing correlation and alert deduplication, the same team's ratio was under 3:1. The engineers didn't change. The incidents didn't change. The signal quality changed, and the on-call experience improved measurably — not just in metrics, but in engineer feedback about the rotation.

What on-call engineers actually ask for

When you ask on-call engineers what would make their rotation better, they rarely ask for fewer incidents. Incidents are part of the job and most experienced engineers understand that. What they ask for, consistently, is: context delivered with the alert rather than assembled during the incident, a clear indication of what to do first (a runbook they can trust, a known pattern they recognize), a way to quickly determine whether this is within their domain or needs escalation, and confidence that the alert is real rather than noise.

All four of these are tooling problems. Context delivery is a correlation and alert enrichment problem. Runbooks they can trust is a documentation quality and runbook-to-alert attachment problem. Escalation clarity is an ownership model and incident routing problem. Noise is a signal quality and alert deduplication problem. None of them are solved by changing the on-call schedule, hiring more engineers, or adding wellness programs — though those things matter too.

The sustainability calculus

On-call rotations are sustainable when the experience of being on call is bounded and manageable. Bounded means: pages don't come at a rate that prevents recovery between incidents. Manageable means: when a page comes, the engineer has what they need to handle it effectively. The rotation size and schedule affect "bounded." The tooling quality affects "manageable."

Most organizations have invested heavily in the bounded side — calculating rotation sizes, setting escalation paths, defining incident severity thresholds. Many have underinvested in the manageable side — the quality of context delivered per alert, the fidelity of runbooks, the reliability of alert signal quality. The returns on improving signal quality, runbook fidelity, and context delivery are high precisely because they improve every page in the rotation, not just the scheduling overhead around them.

Burnout prevention that only addresses rotation structure without addressing signal quality is addressing the wrong constraint for most teams. Fix the tooling first. The tooling is why on-call is exhausting; the schedule is just the frame around the experience.