OpenTelemetry: the foundation you still need to build on

OpenTelemetry is one of the most important infrastructure decisions the industry has made in the past decade. The consolidation of tracing, metrics, and logs into a single, vendor-neutral instrumentation standard has genuinely improved the ecosystem. You can instrument once and route signals to multiple backends. You can change backends without re-instrumenting. The OTLP wire protocol has become the common language of observability data in motion.

OTel standardized the signal. It didn't standardize the insight. The gap between "we have OTel instrumentation across our services" and "we have meaningful incident understanding" is where real engineering work happens — and it's a gap that OTel's architecture intentionally doesn't close.

What OTel actually gives you

OpenTelemetry provides: a set of language-specific SDKs for instrumenting applications, a Collector component for receiving, processing, and exporting telemetry data, the OTLP protocol for transporting that data, and semantic conventions for naming attributes consistently across services and languages. It's an instrumentation and transport standard, not an analysis platform.

When you deploy OTel correctly, you get: traces that capture request flows across service boundaries (assuming your services propagate W3C trace context headers), metrics exported in OTLP format that any compatible backend can receive, and structured logs that can carry trace context to correlate with spans. This is genuinely powerful baseline coverage.

What OTel doesn't provide: a storage backend for your signals, a query layer for investigation, alert rules, an incident model, or any mechanism for correlating signals from different services or different signal types. OTel ends at the collector output. Everything downstream of the collector is your problem to solve — or someone else's product to buy.

The collector-to-backend gap

The OTLP Collector → backend pipeline is well understood. Your traces go to Jaeger or Tempo. Your metrics go to Prometheus or a compatible backend. Your logs go to Loki or an aggregator. Each of these backends has excellent tooling for querying its specific data type: distributed trace visualization in Jaeger, metric exploration in Prometheus, log search in Loki.

The gap appears when an incident crosses signal types. A p99 latency spike (metrics) is caused by a pod OOMKill (Kubernetes event — not an OTel signal) which was caused by a Helm chart change (config event — also not an OTel signal) that happened 20 minutes before the spike. OTel gives you high-quality trace data showing which services experienced the latency spike. It gives you metric data showing the p99 trend. It doesn't give you the Kubernetes event. It doesn't give you the Helm change. And even the trace data and metric data live in different backends with different query interfaces.

Correlating an OTel trace showing "request X took 4.2 seconds" with a Prometheus metric showing "p99 was 4.1 seconds in that window" with a Kubernetes event showing "pod restarted at T-60s" with a Helm audit log showing "chart upgraded at T-20min" requires navigating four separate systems, four separate query languages, and manually aligning timestamps. This is the correlation gap that OTel doesn't address, because it's not what OTel is designed to address.

Cardinality: the OTel scaling challenge

One of the more painful realities of OTel in production is cardinality management. OTel semantic conventions encourage (correctly) labeling metrics with service name, service version, environment, and other attributes. As teams gain confidence in OTel, they naturally want to add more attributes: customer ID, tenant ID, user tier, feature flag state. Each additional attribute multiplies the number of unique time series.

A metric with 5 attributes where each has modest cardinality — say, 10 service names × 5 versions × 3 environments × 1000 customer IDs × 2 user tiers — produces 300,000 unique time series from a single metric. Push that through an OTLP Collector to Prometheus, and you have a cardinality explosion that will cause your Prometheus instance to OOM or become too slow to query reliably.

The OTel Collector's processor pipeline has cardinality limiting capabilities — attribute filtering, metric aggregation transformations, sampling for traces. But configuring these correctly requires understanding your cardinality profile, which requires querying your metrics backend, which may already be struggling under cardinality pressure. Teams often discover this problem after deployment, not before.

We're not saying high-cardinality metrics are wrong — they're often the right data model for per-customer performance analysis. We're saying the OTLP pipeline doesn't automatically handle cardinality for you. This is infrastructure work that has to happen between instrumentation and useful signal delivery.

Trace context propagation: the failure case that hurts

Distributed tracing only works when trace context propagates correctly through your entire call path. In a polyglot microservices environment, this is harder than it sounds. Every service in the call path must: use an OTel-compatible SDK or instrumentation library, be configured to propagate W3C trace context headers, not strip or overwrite those headers in any middleware layer, and correctly handle async paths (message queues, background jobs) that break the synchronous call model.

A common failure pattern: a Java service instrumented with OTel propagates trace context correctly. It calls a Python service instrumented with OTel — context propagates. The Python service puts a job on a Redis queue. The Go worker that processes that job was instrumented with OTel, but nobody configured the queue consumer to extract and propagate the trace context from the job payload. The trace breaks at the queue boundary. You get two separate traces — one ending at the producer, one starting fresh at the consumer — and no way to reconstruct the end-to-end latency or identify that the consumer was the slow path.

This is a real-world implementation gap that requires active investment to close. It's not a flaw in OTel's design — the standard covers async propagation through baggage and B3 header conventions. It's a deployment reality: in a large microservices fleet, every service team needs to understand and correctly implement propagation, and there will be gaps. Auditing trace continuity across service boundaries is a maintenance task, not a one-time setup.

Building on OTel: what the layer above needs to do

Given OTel's scope — transport and instrumentation, not analysis — the layer you build on top of it needs to provide what OTel intentionally omits. At minimum: unified storage or federation that allows querying across trace, metric, and log data in a single interface. Alert rules that operate on OTel-exported metrics. And critically, the ability to correlate OTel signals with non-OTel signals: Kubernetes events, config change events, infrastructure metrics that come from node exporters rather than OTel-instrumented applications.

The OTel Collector's export architecture makes this easier than it was before the standard existed. You can export your OTLP data to a backend that enriches it with context from other sources — Kubernetes metadata, deployment labels, node annotations — because the collector can read from the Kubernetes API and attach metadata to spans and metrics before export. This enrichment pattern turns OTel data from "signals with service-level context" to "signals with infrastructure-level context," which is the difference between seeing that a service had a latency spike and seeing that the service had a latency spike because it's running on a node that was under memory pressure.

OTel is the right foundation. It's a foundation, not a finished building. The work of turning good telemetry collection into effective incident response happens in the layer above the collector: in the correlation engine, in the alert model, in the incident context delivery system. That's where organizations that have adopted OTel well separate themselves from organizations that have OTel deployed but still fight 45-minute incident response times. The instrumentation was the easy part.

What OTel actually gives you

The collector-to-backend gap

Cardinality: the OTel scaling challenge

Trace context propagation: the failure case that hurts

Building on OTel: what the layer above needs to do

More from the blog

Building correlation, not just collection

Incident correlation at 200 microservices

Config drift: the invisible outage cause