Mean time to WTF: a better MTTR metric

Incident timeline bar chart segmenting detection and resolution phases

MTTR — Mean Time to Resolve — is the metric most engineering organizations track to measure incident response effectiveness. It's also, in many ways, the wrong metric. Not because resolution time doesn't matter — it clearly does — but because MTTR is a single number that aggregates very different kinds of time, hiding the phase where most of the pain actually lives.

From the moment your monitoring system detects an anomaly to the moment the incident is declared resolved, time passes through three distinct phases: detection and notification, context assembly and diagnosis, and remediation. MTTR bundles all three. When your MTTR improves, you don't know which phase got faster. When it gets worse, you don't know where to invest. The aggregate hides the structure.

Breaking down the incident timeline

Let's look at what a typical significant incident actually looks like, phase by phase.

Detection and notification starts when the anomaly occurs and ends when an engineer has been paged and has the alert in front of them. In organizations with well-tuned alerting, this is 2–5 minutes. In organizations with alert fatigue and noisy queues, engineers delay acknowledging pages or, worse, silence repeated alerts from sources they've learned to distrust. Detection time can stretch to 20+ minutes if alert quality is poor.

Context assembly and diagnosis — what we call Mean Time to WTF — starts when the engineer opens the alert and ends when they have a working hypothesis about root cause. This is where most incident time is spent. It's the phase where the engineer is jumping between dashboards, checking logs, querying metrics, looking up recent deploys, messaging teammates who might know the system. In well-instrumented, well-organized environments: 10–20 minutes. In under-instrumented environments with fragmented tooling: 45–90 minutes. The variance is enormous.

Remediation starts once the engineer has a hypothesis and ends when the service is restored. For simple fixes (restart a pod, roll back a deploy), this can be 2–5 minutes. For complex root causes (data migration rollback, network configuration issue, database corruption), it can be hours. The remediation time is largely bounded by the nature of the fix, not the tooling — there's less to optimize here compared to the context assembly phase.

Why the WTF phase dominates your MTTR

In the incident timelines we've analyzed, context assembly and diagnosis typically represents 50–70% of total MTTR for significant incidents. It's the dominant phase, and it's also the most improvable — not by hiring better engineers, but by changing the quality of information they have when the incident starts.

Consider two engineers handling the same incident. Engineer A gets paged and sees: "p99 latency spiked on payment-api." They open their metrics dashboard, check the service graph, look at recent traffic patterns, check the pod events, look at the deploy history. After 35 minutes, they find that a Helm chart update from 40 minutes ago changed the database connection pool configuration. They roll it back. Resolution in 45 minutes.

Engineer B gets paged and sees: "p99 latency spiked on payment-api. Correlated: pod restart (OOMKill) at 23:41 UTC, Helm chart upgrade payment-api v3.1.1→v3.1.2 at 23:38 UTC, changes include connection pool max-size 50→20." Engineer B opens the Helm history, confirms the change, runs a rollback. Resolution in 8 minutes.

Same root cause. Same engineer quality. Same remediation action. Radically different WTF time. The difference is the quality of context delivered at alert time.

Measuring Mean Time to WTF

To measure Mean Time to WTF (MTTWTF) in practice, you need to segment your incident timeline at phase boundaries. Most incident management systems (PagerDuty, Opsgenie, and similar) capture alert time and acknowledgement time. They don't natively capture "time at which the engineer had a working root cause hypothesis."

A practical proxy: instrument the point in your incident workflow where the engineer first updates the incident with a hypothesis or a specific action taken. The time between alert acknowledgement and the first substantive incident update is a reasonable MTTWTF proxy. It's not perfect — an engineer might form a correct hypothesis mentally before logging it — but over a population of incidents it correlates well with actual context assembly time.

An alternative approach is to measure it through postmortem structured data. If your postmortem template includes a "time of first correct hypothesis" field, you can track this over time. This requires postmortem discipline but produces cleaner measurements. Some teams use a "timeline anchors" format in postmortems: detected at T+0, paged at T+2, root cause identified at T+28, remediation started at T+31, resolved at T+38. Each anchor is filled in from memory or from log data after the fact.

What a good MTTWTF target looks like

We're hesitant to give a universal MTTWTF target because it varies significantly by system complexity and incident severity. A P1 incident affecting all users in a production system warrants more aggressive MTTWTF targeting than a P3 degradation affecting a small percentage of users. Some rough reference points from teams that have invested in correlation and context quality:

  • P1 incidents with good correlation tooling: MTTWTF under 15 minutes for incidents that have a preceding change event correlated
  • P1 incidents without a correlated change event (novel failure modes): MTTWTF 25–40 minutes is realistic, not a sign of poor performance
  • P2/P3 incidents with clear preceding context: MTTWTF under 10 minutes is achievable
  • Recurring incident patterns (same root cause seen before): MTTWTF under 5 minutes with good runbook coverage

The distinction between incidents with and without a correlated change event is important. When an incident is novel — a new failure mode your system hasn't seen before — context assembly necessarily takes longer. No amount of tooling can replace the investigation work required for genuinely novel failures. The tooling improvement pays out most clearly for the majority of incidents that do have a correlated change or a recurring pattern.

Using MTTWTF to drive the right investments

If you track MTTWTF as a separate metric from overall MTTR, it tells you where to invest. High MTTWTF with low remediation time means your tooling is costing you investigation time — invest in better context delivery, alert correlation, or runbook coverage. Low MTTWTF with high remediation time means your diagnosis is fast but your ability to remediate is slow — invest in deployment automation, rollback tooling, or operational runbook procedures for the specific incident types that take longest to fix.

We're not saying MTTR is a useless metric. Resolution time matters to users and to SLA commitments. We're saying MTTR without phase-level decomposition is a management metric, not an engineering metric. It tells you whether you're getting better or worse overall but doesn't tell you what to change. MTTWTF is where the engineering work is. That's where the 20-minute incidents become 5-minute incidents — not by making engineers faster, but by ensuring they spend their time solving the problem rather than assembling the context to understand what the problem is.