Blog

Engineering writing on observability and incident culture

From the Infrawatch team. We write about what we see in platform engineering environments — alert fatigue, OOMKill patterns, p99 tails, and the correlation layer most teams don't have yet.

Incident response

Why alert storms kill incident response

When 40 alerts fire for one root cause, your engineers aren't facing a monitoring problem. They're facing a signal quality problem.

Kubernetes

OOMKill: the silent incident multiplier

A pod dies quietly. Three services upstream see latency spikes. Your Slack explodes. Here's why OOMKills are the most undertracked incident signal in Kubernetes clusters.

Platform engineering

Building correlation, not just collection

Every team has more telemetry than they know what to do with. The gap isn't collection — it's the layer that links signals together.

Config management

Config drift: the invisible outage cause

The deployment happened 4 hours ago. The incident happened now. Most tools won't make that connection. Here's how to track it.

Metrics

Mean time to WTF: a better MTTR metric

MTTR measures how fast you resolved an incident. But how long did it take your team to understand what was happening? That gap is where the real cost lives.

On-call

On-call burnout is a tooling problem

Engineers don't burn out because they care too much. They burn out because their tools make 3am worse than it needs to be.

Company

Infrawatch 2025 year in review

Private beta to 3 paying customers, 7 major releases, one correlation engine rewrite. Here's what we shipped and what we learned.