OOMKill: the silent incident multiplier

A pod gets OOMKilled. Kubernetes restarts it in a few seconds. Your liveness probe passes. Your readiness probe passes. Prometheus shows the container back up. On the surface: nothing happened. In reality: you just dropped some in-flight requests, potentially corrupted a write-through cache, interrupted a background job mid-execution, and shed any in-memory state that service was maintaining. OOMKill is not a benign restart. It's a hard kill — equivalent to kill -9 at the kernel level.

The problem isn't that one pod OOMKilled. The problem is that OOMKills almost never happen in isolation, and the incident they're part of is usually already underway when the first kill fires.

How OOMKill actually works in Kubernetes

When a container exceeds its memory limit, the Linux OOM killer terminates it. There's no graceful shutdown. No drain period. No SIGTERM followed by a grace period. The process is terminated immediately by the kernel, and Kubernetes respawns the container according to its restart policy. This is fundamentally different from a kubectl rollout restart or a pod eviction, which both give the container time to handle in-flight requests.

The key metric to watch is not just OOMKill events but container_memory_working_set_bytes relative to the resource limit. A container working set at 85% of its limit is not "fine" — it's one traffic spike, one GC pause deferral, or one batch job invocation away from crossing the threshold. By the time the OOMKill event fires in your Kubernetes events API, the container was already in memory pressure for some period before the kill.

Kubernetes also distinguishes between memory limit violations (which cause OOMKill) and node-level memory pressure (which causes eviction). Evictions happen when the node itself is under memory pressure and the kubelet needs to reclaim memory. Evictions go through a pod priority and QoS class system. OOMKills bypass all of that — they're triggered by the container runtime's cgroup limit enforcement, not by the kubelet scheduler. This distinction matters when you're diagnosing whether you have a container sizing problem, a node sizing problem, or both.

The OOMKill as a correlated signal, not an isolated event

Consider a platform engineering team managing a 160-service mesh at a SaaS company. They have an order processing service that handles peak traffic on weekday mornings. The service memory usage is typically 1.2GB against a 2GB limit — comfortable headroom. During one week's peak window, a new feature flag rolls out that enables an additional enrichment step in the order processing pipeline. The enrichment loads a reference dataset into memory for each worker thread. Four threads, ~300MB dataset each — that's 1.2GB of additional allocation, pushing total working set to 2.4GB, breaching the 2GB limit.

Three pods OOMKill within a 40-second window. Each restart takes about 15 seconds. During those 15 seconds, in-flight order processing requests fail. The upstream API gateway starts seeing a spike in 503 errors. The downstream notification service starts seeing empty order IDs in its queue (requests that were mid-processing when the kill happened wrote partial state to the queue). The SLO burn rate alert fires. The notification service dead-letter queue starts filling. Now you have four separate active alerts: OOMKill events, gateway 503 spike, notification service error rate, and dead-letter queue depth.

If your alerting system sees these as four independent incidents, you have four investigation threads and probably three different on-call engineers paged. If your alerting system can correlate them — same namespace, same 40-second window, OOMKill event as the probable cause — you have one incident with a clear starting point.

Memory limits: the right-sizing paradox

The instinctive response to OOMKills is to raise the memory limit. Sometimes that's right. Often it's the wrong fix for the actual problem — and it papers over a signal that would have told you something important about your application's memory behavior.

A service that's OOMKilling due to a memory leak needs its leak fixed, not more headroom. Raising the limit delays the kill but doesn't eliminate it — you now OOMKill every 48 hours instead of every 12, which is strictly worse from a detection standpoint because the issue becomes harder to observe and reproduce. A service that OOMKills due to unbounded in-memory caching needs a cache eviction policy, not a bigger limit. A service that OOMKills during batch jobs needs those jobs separated from the serving process, not a larger container.

We're not saying you should never raise memory limits. For a service that has genuinely grown its working set due to increased functionality — like the enrichment example above — a limit increase combined with a memory budget review is the right call. The distinction is between a service that has legitimately outgrown its allocation versus a service that has a pathological memory behavior masked by a generous limit.

Detecting pre-OOMKill memory pressure

The most useful OOMKill-related metric is one that fires before the kill, not after. container_memory_working_set_bytes / container_spec_memory_limit_bytes gives you memory utilization ratio per container. Alert at 80% with a 5-minute sustained window. At 90% with a 2-minute window. This gives your on-call engineer warning while the container is still running and can be investigated or manually restarted gracefully.

Pair this with kube_pod_container_status_last_terminated_reason == "OOMKilled" to track which containers have OOMKilled historically. A container that OOMKills repeatedly but never shows up on memory pressure alerts is probably hitting the limit in a very short burst — which points to a specific allocation event (a batch job, a large request payload, a particular API call pattern) rather than a sustained memory growth problem. These are different root causes requiring different fixes.

OOMKill in the context of config changes

The most operationally useful OOMKill correlation is with your config change stream. A service that OOMKills within 30 minutes of a Helm upgrade, a ConfigMap change, or a feature flag rollout is almost certainly doing so because of that change. Without the change context attached to the OOMKill event, your on-call engineer starts the investigation cold — they'll check the service logs, look at recent traffic patterns, examine the memory profile. With the change context, the first hypothesis is obvious: what did this change allocate?

This is why OOMKill events should flow through the same correlation pipeline as your metrics alerts and config change events. The OOMKill alone is ambiguous. The OOMKill plus the Helm chart version bump 25 minutes prior is a starting point. The OOMKill plus the Helm chart bump plus the memory working set trend showing a step change at deploy time is a near-complete incident narrative before the engineer has written a single kubectl command.

OOMKills are not benign restarts that your cluster handles gracefully. They're hard signals that something in your memory model is wrong — either the application's behavior, the resource allocation, or the interaction between a change and the service's runtime. Treat them as first-class incident signals, correlate them with what changed, and you'll find root causes before postmortem season.

How OOMKill actually works in Kubernetes

The OOMKill as a correlated signal, not an isolated event

Memory limits: the right-sizing paradox

Detecting pre-OOMKill memory pressure

OOMKill in the context of config changes

More from the blog

Incident correlation at 200 microservices

Building correlation, not just collection

On-call burnout is a tooling problem