Platform engineering teams need different tools

SRE and platform engineering are often used interchangeably. They shouldn't be. An SRE's primary job is keeping services available — their mental model is service-centric, their tooling centers on alert thresholds, error rates, and SLO burn. A platform engineer's primary job is keeping the platform itself healthy so that all the services running on it can operate correctly — their mental model is infrastructure-centric, and the questions they're asking are fundamentally different.

The tools built for SRE work — and they're very good at what they do. But when platform engineers reach for those same tools, they're squinting at dashboards designed to answer the wrong question. The observability market built for the SRE. Platform engineering teams are making do.

The question difference

An SRE investigating an incident asks: "Is my service healthy? What's the error rate? What's the p99 latency? Which endpoint is degraded?" These are service-scoped questions with known answers if you have the right metrics.

A platform engineer investigating an incident asks: "Is the platform causing service degradation? Is this a node problem, a networking problem, a control plane problem, or a service problem? How many services are affected? Is this correlated with something I changed in the platform layer?" These are cross-service, infrastructure-scoped questions that require a different data model to answer.

Take a specific example: a Kubernetes cluster in us-west-2 starts experiencing intermittent pod scheduling failures. Individual service SREs each see their service acting up — some requests failing, some pods in Pending state. Each of them is looking at their service dashboard and seeing symptoms. None of them can see from their service dashboard that the issue is actually at the cluster level: the kube-scheduler is taking 8 seconds to make scheduling decisions because a buggy admission webhook is timing out on every pod creation request. The platform engineer sees this — but only if they have visibility into control plane latency, webhook audit logs, and cross-service pod scheduling timelines simultaneously.

What SRE tools don't surface for platform teams

Standard SRE observability tooling has a fundamental design assumption: the service is the unit of concern. Dashboards are organized per service. Alerts are scoped per service. SLOs are defined per service. When you're an SRE responsible for service reliability, this is the right model.

Platform engineers need to think about the infrastructure substrate that all services share. The questions that matter are: Which nodes are under memory pressure right now? Which namespaces are hitting resource quotas? How has the cluster's overall pod scheduling latency trended over the past 6 hours? Which services experienced pod restarts in the last 24 hours and were those restarts correlated with a node event? What's the correlation between my etcd write latency and my API server response times?

These questions require cross-service aggregation. They require infrastructure-level metrics that don't belong to any single service. They require correlating Kubernetes control plane health with workload behavior. Standard per-service dashboards don't answer them — you'd need to build custom dashboards that aggregate across services, and even then you're missing the correlation layer that connects infrastructure events to workload symptoms.

The topology model as the platform engineer's primary tool

If SRE tooling centers on the service as the primary entity, platform engineering tooling should center on the topology: the graph of services, nodes, namespaces, network policies, and resource pools that constitutes the platform. When something goes wrong, the first question is where in the topology it is — which then tells you whose problem it is (platform team vs. individual service team) and what the blast radius is.

A topology-first view shows you: service A is having a problem, and service B which depends on service A is starting to show upstream degradation, and both A and B are running on the same node group that had a node recycling event 12 minutes ago. This view isn't available if your observability system is organized per-service rather than per-topology.

Building and maintaining a runtime topology model is hard. It requires pulling from multiple sources: the Kubernetes service mesh (or Istio / Linkerd topology if you're running one), network policy graphs, deployment dependency specifications, and trace-derived call graphs. Static architecture diagrams go stale within weeks in an active microservices environment. The topology needs to be derived from live traffic data, not from human-maintained documentation.

Incident ownership: platform vs. service

One of the most expensive parts of platform incidents is the handoff problem: is this a platform issue or a service issue? Platform incidents that look like service issues — and vice versa — burn significant on-call time as service teams and the platform team each investigate on their side before realizing they're looking at the same root cause from different angles.

A well-known scenario in multi-team environments: a DNS resolution issue at the cluster level causes intermittent failures in a dozen services. Each service team sees their own service failing and starts their standard incident response. Half of them page the database team (it looks like a connection issue). The other half open tickets against the API they depend on. The platform engineer eventually figures out the DNS issue, but by that point four teams have woken up and two incidents have been incorrectly scoped.

The tooling problem here is that there's no shared view that simultaneously shows "12 services are experiencing failures" and "these 12 services share a DNS configuration in the same cluster." That view requires cross-service aggregation with infrastructure topology context — which is exactly what platform-first observability should provide, and what service-centric tooling doesn't.

What platform engineers actually need from observability

Platform engineering teams need observability tooling built around three capabilities that SRE tooling underserves. First: cross-service impact aggregation — the ability to see how many services and users are affected by a single infrastructure event, without requiring a per-service dashboard review. Second: infrastructure-to-workload correlation — connecting control plane health, node events, and networking conditions to service-level symptoms. Third: change attribution at the platform layer — when a Kubernetes version upgrade, a CNI plugin change, or a node pool configuration update happens, what services were affected and how?

We're not saying SRE tooling is wrong for SREs. We're saying platform engineers using the same tools are spending too much time working around a mental model mismatch. The tools were built for a specific job — the SRE's job — and that job is different from the platform engineer's job in ways that matter when an incident is active and every minute of context-gathering time costs user-facing availability.

The platform engineering discipline has matured enough to warrant tools designed around its specific questions. The "internal developer platform" conversation has dominated the space for the past few years, focused on developer experience and self-service. The observability side of platform engineering — how does the platform team see and diagnose what's happening in the infrastructure layer — deserves the same focused attention.

The question difference

What SRE tools don't surface for platform teams

The topology model as the platform engineer's primary tool

Incident ownership: platform vs. service

What platform engineers actually need from observability

More from the blog

Building correlation, not just collection

Incident correlation at 200 microservices

On-call burnout is a tooling problem