Config drift: the invisible outage cause

The postmortem says: "Root cause: misconfigured connection pool timeout." What the postmortem doesn't say — because nobody put it in the timeline — is that the connection pool timeout was changed three weeks ago in a Helm chart update that was bundled with a routine dependency version bump. It shipped on a Tuesday afternoon. Nobody noticed. The incident happened on a Friday night when traffic hit an unusual pattern that stressed the timeout value.

This is config drift. Not the classic definition — where production diverges from your desired state repo — but the broader phenomenon: your infrastructure's configuration changed at some point in the past, the change was not connected to the subsequent incident, and your postmortem spent two hours reconstructing a timeline that should have been available in minutes.

Why config changes are the hardest incident signal to correlate

Metrics are continuous. Kubernetes events are timestamped and structured. Config changes are neither. They happen discretely, through disparate systems (Helm, Terraform, ArgoCD, kubectl apply, AWS Console clicks, direct file edits), and they may not produce a structured event that flows into your observability pipeline at all.

A Prometheus alert firing at 23:47 UTC produces a clean, timestamped, labeled event. A Helm upgrade that deployed at 23:31 UTC may live only in your CI/CD system's audit log — not in any system that your incident investigation workflow touches. The 16-minute gap between that upgrade and the alert is exactly the kind of causal chain your on-call engineer needs, and exactly the kind of chain that doesn't exist in any single tool.

This fragmentation is the core problem. Your config changes are spread across: Helm release history in the cluster, Terraform state and run history, GitOps ArgoCD sync events, Kubernetes ConfigMap and Secret change history, feature flag service events, and the occasional undocumented manual change applied under pressure during a previous incident. Correlating a current incident with a relevant past config change requires querying all of these systems, which is a significant amount of friction at 2am.

A real-world pattern: the silent Helm change

Here's a scenario that repeats with uncomfortable frequency across platform teams. A growing e-commerce platform runs their checkout service with a Redis cache layer. The checkout service Helm chart is updated as part of a routine dependency upgrade. The chart's default values include a maxmemory-policy setting for Redis — it was allkeys-lru in the old chart version, and noeviction in the new one. The person doing the upgrade doesn't review default value changes because the upgrade is "just a dependency version bump."

For two weeks, nothing happens. The cache is large enough that it never hits its memory limit. Then a marketing campaign launches, traffic triples, and the cache fills. Under the old allkeys-lru policy, Redis would have started evicting the least recently used keys. Under noeviction, Redis starts returning OOM command not allowed errors instead. The checkout service, which doesn't handle Redis errors gracefully, starts throwing 500s. The incident fires.

In the postmortem, the team finds the incident root cause in about 45 minutes. But the timeline question — "when did this configuration change?" — takes another hour and a half and requires someone who knows to look at the Helm release history and diff the default values between the two chart versions. That 1.5 hours is avoidable if the Helm chart change was fingerprinted and associated with the incident window automatically.

Config fingerprinting: what it means and what it requires

Config fingerprinting is the practice of capturing a structured snapshot of your configuration state at each change event and making that snapshot queryable in the context of incident investigation. A fingerprint includes: what changed (the diff), when it changed (timestamp with timezone), what system applied it (Helm / Terraform / kubectl / ArgoCD), which services it touches (namespace, deployment, service name), and an identifier linking back to the source artifact (chart version, Terraform module commit, PR number).

To make fingerprinting work, you need change events flowing into a central system from all the places configuration lives. For Kubernetes-native changes, the Kubernetes API audit log is the authoritative source — every ConfigMap create/update/delete, every Deployment spec change, every HPA scaling event is there. For Helm, the release history stored in Secrets in the cluster plus your Helm repository audit log. For Terraform, run history via the Terraform Cloud / Enterprise API or from your CI/CD system's job artifacts. For ArgoCD, sync events from the ArgoCD application controller.

The harder case is out-of-band changes: an engineer applies a patch directly with kubectl edit during a previous incident, or a manual console change is made by someone without cluster access. These changes may not produce a structured event anywhere. The closest you can get is reconciliation: periodically diffing the current cluster state against your GitOps desired state and flagging divergences. This is exactly the "config drift" that GitOps was designed to prevent — and even GitOps-enforced environments have escape hatches that get used under incident pressure.

The time-window correlation problem

Not all config changes cause immediate incidents. Some changes have delayed effects: a timeout value change that only matters under high load, a memory limit change that only causes OOMKills during batch job windows, a circuit breaker threshold change that only triggers during downstream degradation. The correlation window between a config change and a related incident might be hours or days, not minutes.

This makes automated correlation harder. A 5-minute correlation window catches obvious cases (deploy + immediate error spike). A 24-hour window produces too many false positives — everything correlates with everything if the window is wide enough. The right approach is layered: tight automatic correlation for high-confidence causal relationships (change + immediate signal anomaly in the same namespace), and human-assisted correlation for longer-window relationships (surfacing "here are config changes in the last 7 days that touched this service" during active investigation).

We're not saying automated correlation replaces postmortem investigation for complex causal chains. We're saying it should eliminate the "how do I even find all the changes that touched this service in the past 48 hours" step, which is currently a manual archaeology exercise in most teams.

Making config change correlation operational

The practical starting point is: get all your change events into one place with consistent timestamps and service attribution. This doesn't require replacing your existing config management systems. It requires building or adopting a pipeline that reads from Kubernetes audit logs, Helm release history, your CI/CD system's deployment events, and ArgoCD sync logs — and normalizes these into a unified change event stream with a consistent schema.

Once you have that stream, correlation becomes a query: given an incident start time and affected services, what config changes touched those services in the preceding N hours? That query should return results in seconds, not require a 90-minute postmortem archaeology session. Platform teams that build this capability consistently report that it changes the character of incident response — not just by reducing MTTR, but by shifting the investigation from "what could have caused this" to "which of these specific changes is the most likely cause." That's a different cognitive task, and a more productive one.

Why config changes are the hardest incident signal to correlate

A real-world pattern: the silent Helm change

Config fingerprinting: what it means and what it requires

The time-window correlation problem

Making config change correlation operational

More from the blog

Incident correlation at 200 microservices

Building correlation, not just collection

OpenTelemetry: the foundation you still need to build on