Why platform engineers observe infra and help devs monitor apps

Learn why platform engineers must observe infrastructure and empower developers with effortless app monitoring and how OpenTelemetry enables both at scale.

Sam Barlien

Head of Ecosystem @ Platform Engineering

•

Published on

November 18, 2025

Platform engineering has a dual mandate that most teams aren't properly thinking about. You need to maintain operational visibility of your shared infrastructure (Kubernetes clusters, CI/CD pipelines, service meshes, etc) while simultaneously enabling developers to monitor their own applications without friction. This isn't about choosing between infrastructure observation and developer enablement. It's about recognizing that modern distributed systems have made both responsibilities equally critical.

The challenge is clear: an unobservable platform creates poor developer experience, wastes engineering time, and forces constant guessing during incidents. When you build self-service platforms, developers will deploy applications in ways you cannot predict. Traditional centralized monitoring breaks down precisely when it matters most.

Why infrastructure visibility alone fails

Platform engineers face a fundamental problem. You can monitor every metric from your Kubernetes control plane, track every CI/CD pipeline execution, and alert on every infrastructure anomaly, and still have no idea why a developer's application is failing in production.

Self-service platforms create scenarios you cannot anticipate. Developers will configure services, deploy workloads, and integrate dependencies in combinations that your centralized dashboards were never designed to capture. You're building gates, not castles. Once developers pass through those gates, they need their own visibility.

When centralized monitoring becomes a bottleneck

Traditional monitoring assumes predictability. You define what to watch, set thresholds, and wait for alerts. This works when you control the entire stack and know exactly what's running.

But platform engineering inverts this model. You provide capabilities like compute, storage, networking, deployment pipelines, etc, and developers compose them in unpredictable ways. Your infrastructure might be healthy while their applications are failing due to misconfigured service discovery, incorrect environment variables, or unexpected latency in external dependencies.

The limitations are structural:

Coverage gaps: You can't instrument what you don't know exists
Context loss: Infrastructure metrics don't explain application behavior
Support burden: Every incident becomes a platform team investigation

This is why the dual mandate exists. You need infrastructure observation to maintain the platform itself. Developers need application monitoring to operate what they build on top of it.

Observability as a platform product: Shifting from tools to capabilities

The solution, however, isn't about which monitoring tooling you are using; it's about how you are using it. It's treating observability as a platform capability you build for internal customers, your developers.

When you adopt an Observability as a Product mindset, you naturally fulfill both responsibilities. You maintain operational visibility of shared infrastructure while delivering self-service capabilities that empower developers. This isn't about deploying a vendor's observability platform and calling it done. It's about designing, building, and operating observability as a feature of your platform, complete with documentation, self-service interfaces, and clear support models.

This approach delivers three critical outcomes:

Reduced cognitive load: Developers get visibility by default, not through manual instrumentation
Faster incident resolution: Teams debug their own applications without waiting for platform support
Scalable operations: Your platform team doesn't become a bottleneck for every production issue

We know that observability directly improves developer experience. When developers can quickly understand why their application is behaving unexpectedly, they ship faster and with more confidence. When they can't, they file tickets, schedule meetings, and wait for your team to investigate infrastructure that's probably working fine.

So by building observability directly to where developers are in the platform, you are using observability to directly improve developers' workflows, without any added toil.

Building on OpenTelemetry standards

OpenTelemetry provides the vendor-neutral foundation that makes the mandates practical. It's not just another telemetry collection tool; it's the standardized approach that lets you observe infrastructure and enable developers simultaneously.

Manual instrumentation doesn't scale. When every team implements its own logging conventions, metric names, and trace propagation, you get custom labels, mismatched field names, broken queries, and useless dashboards. OpenTelemetry's semantic conventions solve this by defining a common vocabulary for logs, metrics, and traces. These standards make telemetry portable, queryable, and reusable across all services.

The OpenTelemetry Collector becomes your telemetry router and policy engine. From this central point, you can:

Sample high-volume traces to control costs without changing application code
Enforce compliance by redacting sensitive fields before data leaves your infrastructure
Route different telemetry types to different backends based on team needs

This centralized control is how you maintain infrastructure visibility while enabling developer autonomy. You set the standards, automate the collection, and let developers query their own data.

Enabling developer self-service through practical implementation

You can't just tell developers "observability is available". You must make it effortless to use.

Five essential practices to make observability scalable:

Automate instrumentation at scale: Use the OpenTelemetry Operator to inject auto-instrumentation into workloads. Developers get traces, metrics, and logs without modifying code.
Standardize telemetry collection: Enforce semantic conventions across all services. This makes telemetry consistent and queryable without manual coordination.
Centralize control with the Collector: Route, sample, and enrich telemetry from a single control plane. Change backends, adjust sampling rates, or add enrichment without touching applications.
Implement GitOps for configurations: Treat dashboards and alerts as code. Use tools like Perses to define dashboards in YAML, version them in Git, and deploy via CI/CD. This removes manual UI edits and makes configurations portable.
Provide paved paths with smart defaults: Build opinionated golden paths that deliver value immediately. Things like auto-instrumentation, standard dashboards, and pre-configured alerts. Give developers a head start, not restrictions.

The goal is to remove toil. When developers deploy a new service, they should automatically get basic observability. They can then customize as needed, but the default experience must work out of the box. This default experience might include things like request rates, error rates, latency distributions, and trace sampling.

Understanding the developer journey and pain points

Platform teams often build observability capabilities without understanding where developers actually need them. The result is sophisticated tooling that nobody uses.

Map the complete developer workflow from code commit to production deployment. Where do developers currently struggle? Testing environments that don't match production? Deployment verification that requires manual checks? Incident response that involves guessing which service is failing?

Observability delivers the highest value at specific pain points:

Pre-deployment verification: Can developers validate that their changes work before merging?
Deployment confidence: Can they confirm a rollout succeeded without waiting for user reports?
Incident response: Can they identify the failing component and understand why it's failing?

You also need to be gathering feedback as much as you can. Surveys are great, but nothing beats just directly speaking to your internal customers. Ask developers what questions they can't answer today. Ask them to walk through their last production incident. The gaps in their answers reveal where observability capabilities would have helped.

When you involve developers early in building observability features, they become advocates. They promote the platform, help other teams adopt it, and reduce your support burden massively. This isn't just about making them feel good, it's about leveraging their expertise to build capabilities they actually want to use.

Measuring success: Business outcomes and implementation roadmap

Observability isn't a cost center when implemented correctly. It's a capability multiplier that improves reliability, accelerates incident resolution, and maximizes developer productivity.

Track these specific metrics to demonstrate value:

Mean time to resolution (MTTR): How quickly do teams identify and fix production issues? Effective observability should reduce MTTR significantly within six months.
Developer productivity indicators: How much time do developers spend debugging versus building features? Track support ticket volume and time spent in incident response.
Platform adoption rates: Are developers actually using the observability capabilities you provide? Low adoption indicates poor developer experience, not lack of need.

The business case is straightforward. When developers can debug their own applications, your platform team stops being a bottleneck. When incidents resolve faster, you reduce revenue impact and customer frustration. When telemetry is standardized, you avoid the hidden costs of tool proliferation and data silos.

Common implementation challenges and solutions:

Developer adoption resistance: Developers won't adopt capabilities that add friction. Make observability automatic, not optional. Auto-instrumentation removes the adoption barrier.
Tool proliferation: Teams choose their own observability vendors, creating data silos. Standardize on OpenTelemetry for collection while allowing flexibility in backends.
Cost management: Uncontrolled telemetry volume drives up costs. Use the Collector to implement sampling, filtering, and routing policies centrally.

Getting started: Phased implementation approach

Don't attempt to solve everything simultaneously. Implement the dual mandate in phases with clear milestones.

Phase 1 (Months 1-3): Foundation
Deploy the OpenTelemetry Collector and establish semantic conventions. Start with infrastructure observation: Instrument your Kubernetes clusters, CI/CD pipelines, and shared services. This gives you operational visibility while you build developer-facing capabilities.

Phase 2 (Months 4-6): Developer enablement
Implement auto-instrumentation for common application frameworks. Provide standard dashboards and basic alerting. Focus on one or two high-impact use cases like deployment verification or incident response, rather than trying to cover everything.

Phase 3 (Months 7-12): Optimization and scale
Gather developer feedback and iterate on capabilities. Implement cost controls through sampling and routing policies. Expand coverage to additional frameworks and use cases based on actual developer needs.

Success criteria for each phase:

Phase 1: Infrastructure telemetry flowing to centralized backends, platform team using data for operational decisions
Phase 2: Development teams using auto-instrumentation, a measurable reduction in support tickets
Phase 3: MTTR reduced measurably, developers reporting improved debugging experience, observability costs under control

Build internal consensus by demonstrating value incrementally. Start with infrastructure observation to prove the technical foundation works. Then enable one development team as a pilot, measure their outcomes, and use those results to drive broader adoption.

If you want to learn more, check out the Observability for Platform Engineering course we created with Dash0, and review the Observability for Platform Engineers whitepaper to go much deeper.

Frequently asked questions

What's the difference between monitoring and observability?
Monitoring tells you something is wrong. Observability explains why it's broken by letting you ask arbitrary questions of your system's behavior.

Why can't developers just use existing monitoring tools?
Existing tools often require manual instrumentation, lack standardization, and don't integrate with platform workflows. Developers need automated, consistent visibility.

How does OpenTelemetry reduce costs?
OpenTelemetry's Collector lets you sample, filter, and route telemetry centrally, controlling volume and costs without changing application code.

What if developers don't adopt observability capabilities?
Poor adoption indicates friction in the developer experience. Focus on auto-instrumentation and smart defaults that work without manual effort.

Join the Platform Engineering community and connect with peers on Slack.