Infrawatch 2025 year in review

Three years ago, Nadia and I were sitting in a Chicago coffee shop arguing about whether anyone would pay for a correlation layer on top of their existing observability stack. The conventional wisdom was: companies want fewer tools, not more. Observability stacks were already getting complex — Prometheus, Grafana, Jaeger, the alerting layer, the on-call routing layer. Another component in the pipeline felt like the wrong direction.

We built it anyway, because we'd each lived through enough 3am incidents where the data existed and nobody could find the thread fast enough. We knew the pain was real. We weren't sure enough people would pay to fix it. Three years in: they will.

What we shipped in 2025

The core correlation engine has been running since our private beta opened in May 2025. Through the second half of the year, we shipped three things that changed what the product can do.

Alert deduplication (v0.7.0, July 2025). Before this, Infrawatch correlated signals but didn't suppress redundant notifications. You'd get one unified incident card but still receive separate PagerDuty pages for each component alert. The deduplication engine — which runs upstream of your notification channel and collapses correlated alerts before they fire — was the feature that most consistently moved our customers' alert-to-incident ratios from double digits to under 3. This is the feature that changed on-call experiences, not just dashboards.

Runbook attachment per incident (v0.8.0, November 2025). The insight behind this feature: runbooks are most useful when they arrive with the incident rather than requiring the on-call engineer to know where to look. You define runbook associations by incident pattern — when the correlation engine produces an incident matching a defined pattern (service name, signal types, topology), the relevant runbook link surfaces automatically in the incident card. We started simple with URL attachment; the next iteration will support inline runbook steps directly in the card.

ArgoCD change tracking (v0.8.5, January 2026). We'd had Helm and Terraform change ingestion since the beginning. ArgoCD was the missing piece for teams running full GitOps. Sync events, rollbacks, and application health transitions now flow into the config change stream and participate in change-to-incident correlation. For teams that moved to ArgoCD specifically to get better change control, this closes the loop on seeing how their ArgoCD-managed deployments correlate with infrastructure events.

What we learned from our customers

The pattern that showed up most clearly in our customer conversations in 2025: the value of Infrawatch isn't just the specific incidents it helps resolve faster. It's the change in how platform teams think about incidents. When the first notification you receive contains correlated context rather than a single metric breach, the investigation starts at a different cognitive level. You're not asking "what might be related to this?" You're asking "which of these three correlated things is the root cause?"

That's a smaller problem. It's still a hard problem in complex systems — you still need domain knowledge, and you can still go down wrong paths. But it's a tractable problem rather than an archaeological one. Teams that use Infrawatch for several months consistently report that their postmortem culture changes: the timeline sections get shorter because there's less "and then we realized we should look at deploys" and more "the correlated incident card already showed the Helm change." Postmortems become about process improvement rather than timeline reconstruction.

We also learned what doesn't work yet. Our current correlation window is configurable from 2 to 30 minutes. For incidents with longer causal chains — config changes that only produce failures under specific load conditions, migrations that degrade slowly over 6–12 hours — our current windowing doesn't surface the right change context. This is the hardest correlation problem: long-latency causality. We have some ideas about how to approach it, and it's the primary research thread for 2026.

Incident correlation is still the most underserved problem in observability

The observability market in 2025 continued to mature around collection and storage. Full-stack observability platforms got better at ingesting signals, storing them efficiently, and providing flexible query interfaces. The underlying tension we identified when building Infrawatch hasn't changed: collection quality is no longer the bottleneck for most teams. The bottleneck is meaning — connecting signals across sources and time windows into a coherent incident model.

The market has started moving toward this with features like anomaly detection and AIOps-branded correlation features in larger platforms. These are real investments. They're also, in our observation, primarily focused on reducing false positive alerts through better threshold modeling — which is valuable, but different from the correlation problem. Threshold modeling makes individual alerts smarter. Correlation addresses what happens when multiple smart alerts all fire at once because they're measuring different facets of the same underlying event.

We remain focused on the correlation problem specifically. Not all of observability — the ingestion, the query, the dashboards. The connection between signals that already exist in your stack, across the sources they come from, in a model that's aware of your service topology and your change history. That problem is still largely unsolved by the general-purpose observability platforms, and we think it will remain a distinct enough problem to warrant a dedicated tool for some time.

The root cause heatmap: our 2026 Q1 launch

The feature we're most excited about shipping in early 2026 is the root cause heatmap (v0.9.0, in preview now for Platform customers). After an incident resolves, the heatmap shows a weighted view of which signal types and which services most reliably preceded the root cause across your last 90 days of incidents. It answers the question: "In incidents affecting our payment namespace, which signal appeared first with the highest predictive accuracy for the root cause?"

This isn't anomaly detection — we're not predicting whether an incident will happen. We're building a corpus of "what did the incident look like from the signals' perspective" that improves future correlation confidence. If the payment service's Redis OOMKill preceded root-cause incidents 78% of the time in the last quarter, Infrawatch should weight that signal higher in future correlation groupings. The heatmap makes this weighting transparent and adjustable.

We launched private beta in May 2025 with the conviction that there was a distinct, underserved problem at the intersection of metrics, events, and config changes. We ended 2025 with enough customer evidence to know we were right about the problem, and enough shipped product to know we're building in the right direction. What remains is doing it well at the scale the problem demands — and that work is the focus for the year ahead.

What we shipped in 2025

What we learned from our customers

Incident correlation is still the most underserved problem in observability

The root cause heatmap: our 2026 Q1 launch

More from the blog

On-call burnout is a tooling problem

Building correlation, not just collection

Why alert storms kill incident response