Observability isn't enough to tell if your platform is healthy

Adora Nwodo

Senior Platform Engineering Leader @ Stack Overflow

•

Published on

March 26, 2026

There is a question every platform engineering team should be able to answer but most cannot: is our platform actually healthy? Not "is it up?" Not "are the dashboards green?" But genuinely healthy. Is it delivering value to the engineers who use it every day? Is it making their work better? Most teams reach for observability to answer this, and that is where the problem starts.

Observability is powerful. It tells you what is happening inside your systems. But it is an inward-facing discipline, built to answer "why is this breaking?" not "why is nobody using this?" or "is this tool actually making my developers faster?" Relying on technical signals alone creates what I call a ghost town platform: a technically flawless environment that developers quietly bypass in favor of custom scripts or manual workarounds because the golden paths are too rigid, too confusing, or just not worth the trouble. The system looks healthy. The platform is not.

In platform engineering, health does not equal availability. Health equals utility.

Adoption and usage

A healthy platform is a popular one. The most honest signal of platform health is whether developers are choosing to use it. There is a meaningful difference between usage and adoption: usage can be mandated, adoption has to be earned. If your platform is healthy, developers take the paved path because it is faster and safer, not because you blocked the alternatives.

The metrics that matter here are things like onboarding time (how long from a new engineer's first day to their first deployment?), self-service rate (what percentage of infrastructure requests happen without someone opening a ticket?), and whether teams that were never told to use the platform are showing up anyway. When adoption is low, the instinct is to blame communication. Usually, it is a product problem.

Developer experience

High adoption does not always equal high satisfaction. Developers may use a tool because they have to, not because it works well for them. This distinction matters, and the only way to see it is to ask.

A simple developer NPS survey run quarterly, with one open text field for comments, will surface things no dashboard ever will. That open text field is where you find out about the workflow that costs three engineers thirty minutes every week, or the error message that has confused four separate teams. The 2025 DORA research found that the platform capability most correlated with a positive developer experience is giving developers clear feedback on the outcome of their tasks. Not the fanciest tooling. Just clarity. If your platform leaves people guessing when something goes wrong, that is a health problem, regardless of your uptime numbers.

Reliability and stability

Technical reliability still matters. It is just not the whole story. The instinct is to track uptime, but uptime is a floor, not a ceiling. What matters more is how the platform behaves when things go wrong. A platform that recovers in twenty minutes is healthier than one that rarely breaks but takes four hours to recover when it does.

Service Level Objectives framed from the developer's perspective (99% of deployments succeed on the first try, for example) are more useful than infrastructure-level SLOs. Change failure rate tells you whether the platform's guardrails are actually catching problems early. And Mean Time to Recovery tells you how quickly the platform helps engineers get back on their feet. Together, these metrics describe reliability as a developer experience, not just a system property.

Efficiency and toil reduction

This is perhaps the clearest health signal of all, and the one most directly connected to business value. Toil is the repetitive, manual work that scales with growth and kills innovation. Provisioning infrastructure by hand. Running the same runbook every deploy. Chasing approvals for access that should be self-service.

A healthy platform absorbs toil over time. If you track the ratio of engineering time spent on unplanned work, tickets, and maintenance versus actual product work, you get one of the most honest reads available on whether the platform is doing its job. DORA metrics like lead time for changes and deployment frequency tell a complementary story about velocity. And when you can show that toil dropped from 35% of engineering capacity to 12% over two quarters, you have a leadership conversation, not just a status update.

From monitoring to management

Making this shift requires treating the platform as a product. That means maintaining a roadmap based on developer feedback, not just infrastructure priorities. It means publishing platform health reports that combine technical telemetry with human sentiment. And it means having someone whose job it is to ask, regularly: are we actually getting better at serving the engineers who depend on us?

Nearly 30% of platform teams still do not measure their success in any formal way, according to the 2025 State of Platform Engineering Report. Platform initiatives that cannot quantify their impact often face defunding within twelve to eighteen months. Observability data alone does not make a compelling leadership story. But "we reduced onboarding time from two weeks to three days" does. "Teams on our golden paths ship 40% faster than those who are not" does.

Observability is the what and why of your system's internal state. Platform health is the so what of your organization's productivity. Track adoption, developer experience, reliability, and toil alongside your technical metrics, and you will have both the tools to improve and the language to prove it.