Blog

Engineering writing on observability and incident culture

From the Infrawatch team. We write about what we see in platform engineering environments — alert fatigue, OOMKill patterns, p99 tails, and the correlation layer most teams don't have yet.

May 22, 2026 Incident response

Why alert storms kill incident response

When 40 alerts fire for one root cause, your engineers aren't facing a monitoring problem. They're facing a signal quality problem.

April 30, 2026 Observability

p99 latency: the metric that hides in plain sight

Your p50 looks fine. Your p95 is borderline. And 1% of your users are experiencing 10-second requests. Here's why that matters.

April 8, 2026 Kubernetes

OOMKill: the silent incident multiplier

A pod dies quietly. Three services upstream see latency spikes. Your Slack explodes. Here's why OOMKills are the most undertracked incident signal in Kubernetes clusters.

March 17, 2026 Platform engineering

Building correlation, not just collection

Every team has more telemetry than they know what to do with. The gap isn't collection — it's the layer that links signals together.

February 24, 2026 Config management

Config drift: the invisible outage cause

The deployment happened 4 hours ago. The incident happened now. Most tools won't make that connection. Here's how to track it.

January 30, 2026 Platform engineering

Platform engineering teams need different tools

SRE teams optimize for individual service reliability. Platform engineering teams need to understand how 200 services fail together.

January 14, 2026 Architecture

Incident correlation at 200 microservices

The correlation problem changes fundamentally when you cross 100 services. Here's what we learned building a system to handle it.

December 9, 2025 On-call

Writing runbooks that actually get used

Most runbooks are written during calm reflection and read during panic. Here's how to close that gap.

November 11, 2025 Metrics

Mean time to WTF: a better MTTR metric

MTTR measures how fast you resolved an incident. But how long did it take your team to understand what was happening? That gap is where the real cost lives.

October 14, 2025 OpenTelemetry

OpenTelemetry: the foundation you still need to build on

OTel standardizes how you collect signals. It doesn't tell you what to do with them when three of them fire at the same time.

September 8, 2025 On-call

On-call burnout is a tooling problem

Engineers don't burn out because they care too much. They burn out because their tools make 3am worse than it needs to be.

December 31, 2025 Company

Infrawatch 2025 year in review

Private beta to 3 paying customers, 7 major releases, one correlation engine rewrite. Here's what we shipped and what we learned.