Building correlation, not just collection

If you work in platform engineering, you've probably deployed a metrics pipeline that ingests millions of data points per minute. You've set up distributed tracing. You've got Kubernetes events flowing into your log aggregator. You have dashboards. You have alerts. And you still spend the first 20 minutes of every significant incident asking "what is actually happening?" before you can start asking "what do I do about it?"

The observability market solved collection. It hasn't solved meaning. This is the distinction we built Infrawatch to address, and it's worth explaining why the gap exists and why it's harder to close than it appears.

The collection-completeness illusion

There's a common belief in observability: if you collect enough data, understanding follows. If you have traces, metrics, and logs — the three observability pillars — you have what you need to diagnose anything. The data is there; you just have to look at the right place.

This belief is partially true. High-quality signals do make incidents diagnosable — eventually. But there's a difference between "diagnosable in principle, given time and the right person" and "immediately actionable for any engineer on the on-call rotation at 3am." The collection-completeness illusion mistakes the former for the latter.

Data completeness is necessary but not sufficient. What transforms complete data into immediate understanding is a model: a structured representation of what the data means in the context of your specific system, at this specific point in time, given what recently changed. That model doesn't exist in your metrics storage. It doesn't live in your trace backend. It's in the heads of your senior engineers, reconstructed from scratch during every incident.

How incidents actually get diagnosed today

Watch an experienced SRE investigate an incident. They don't browse dashboards randomly. They follow a mental dependency graph. They know service A calls service B calls service C. When A is erroring, they check B's error rate first. They know the data pipeline runs at 15-minute intervals and uses a shared database connection pool with the API service. They know the last deploy was 2 days ago and touched the authentication layer.

All of this context is implicit, maintained in the engineer's memory, and not encoded anywhere in their observability stack. When a less experienced on-call engineer gets paged, they have the same dashboards and none of the context. They'll eventually find the root cause — but they'll take 40 minutes instead of 8, and they'll wake up two other people to get there.

The question isn't "do we have enough data?" It's "how do we encode the context that senior engineers carry in their heads into the tooling that all engineers use?"

What correlation means in practice

Correlation, as we think about it, operates at three levels. Signal-level correlation is the simplest: grouping alerts that fire within a temporal window and share a topology relationship. This reduces alert-to-incident ratio and prevents three engineers from being paged about the same root cause. Most observability platforms have some version of this.

Change-aware correlation is harder: connecting signal anomalies to the changes that preceded them. When an error rate spike correlates with a Helm chart upgrade 8 minutes earlier, that relationship is causal with high probability. This requires ingesting your change stream — CI/CD events, Terraform applies, Kubernetes ConfigMap updates, Helm release history — and treating it as a first-class signal source alongside your metrics and events.

Topology-aware correlation is the hardest: understanding that the error in service B is downstream of the anomaly in service A, not a separate problem. This requires a runtime topology model — not a static architecture diagram, but a live representation of how traffic flows through your services. Service mesh data, Kubernetes network policies, and tracing data all contribute to this. The correlation engine needs to know that service A is upstream of service B in the same namespace, so that an anomaly in A appearing 30 seconds before an anomaly in B is likely causal, not coincidental.

Why we built the correlation engine before the collection layer

When we started building Infrawatch, the tempting path was to build another metrics pipeline: store your data with us, we'll give you dashboards. The problem with that path is that it competes on data volume and storage cost against vendors with years of infrastructure investment and economies of scale. It also doesn't solve the problem we care about.

We built the correlation engine as the core product, with integration to existing collection stacks (Prometheus, Datadog, CloudWatch, OpenTelemetry collectors) rather than replacing them. The assumption: your data is already in those systems. What's missing is the layer that connects the signals from different systems into a unified incident model.

This has a specific implication: Infrawatch works best when you already have good signal coverage. It's not the right tool if you don't have metrics on your services, if you have no Kubernetes event visibility, or if you're not tracking your config changes anywhere. We need inputs to correlate. We're a meaning layer on top of your collection tools, not a replacement for them.

The limits of automated correlation

Automated correlation gets you far, but it doesn't eliminate the need for human judgment. Correlation algorithms find temporal and topological coincidences. They can't tell you whether a coincidence is causal or spurious without additional context that sometimes only a human can provide.

False positive correlations are real. A network blip that causes a brief latency spike across multiple services simultaneously will look like a correlated incident even if each service recovered independently and no root cause investigation is warranted. An operator who understands the system knows this is a transient network event; a correlation engine that only sees the signals will flag it as an incident.

We're not claiming correlation eliminates human judgment. We're claiming it gives human judgment better inputs. The engineer looking at a correlated incident card still has to decide what the incident means and what to do about it. What they don't have to do is spend 20 minutes reconstructing context from scratch. That's the gap we're closing: from raw data collection to the point where a human expert's judgment can be applied productively.

Where this goes

The observability space is moving toward automated anomaly detection, automated root cause analysis, and automated remediation. We're skeptical of the fully automated end state for complex production systems. The failure modes of automated remediation are severe enough that most organizations will want humans in the loop for the foreseeable future.

But the intermediate step — reducing the time from "something is wrong" to "a human with context understands what's wrong" — is achievable today. Correlation is that intermediate step. It's not magic; it's engineering: structured data models, graph traversal, temporal windowing, change stream ingestion. The hard part is building it correctly, maintaining the topology model as your infrastructure evolves, and tuning correlation sensitivity for your system's specific false-positive / false-negative tradeoffs. That's the work we do so platform teams don't have to build it themselves.

The collection-completeness illusion

How incidents actually get diagnosed today

What correlation means in practice

Why we built the correlation engine before the collection layer

The limits of automated correlation

Where this goes

More from the blog

Why alert storms kill incident response

OpenTelemetry: the foundation you still need to build on

Incident correlation at 200 microservices