OpenTelemetry: The platform engineer's path to standardized observability

Standardize observability across your platform with OpenTelemetry. Eliminate fragmented telemetry, reduce incident resolution time, and avoid vendor lock-in.

Sam Barlien

Head of Ecosystem @ Platform Engineering

•

Published on

December 1, 2025

You're managing a Kubernetes platform with dozens of services, and every team instruments differently. Custom labels. Mismatched field names. Broken queries. When an incident hits, you're stitching together data from three different tools, guessing which service is failing.

This is the standardization problem that OpenTelemetry solves. OpenTelemetry is the vendor-neutral framework that separates telemetry generation from analysis, giving you control over observability infrastructure while enabling developer self-service. As the second-largest CNCF project by contributor count, OpenTelemetry had more PRs and issues than Kubernetes itself last year. This is the Rails moment for observability.

What OpenTelemetry actually is

OpenTelemetry is a vendor-neutral observability framework that standardizes how you collect, process, and export telemetry data. Think of it as the layer between your applications and your observability backends - it handles instrumentation, collection, and routing while you choose where to analyze the data.

The framework provides three core components:

SDKs and APIs for instrumenting applications across languages (Go, Java, Python, Node, Ruby). These generate logs, metrics, and traces using consistent conventions, so telemetry from a Java service looks structurally identical to telemetry from a Go service.

Semantic conventions that define the common vocabulary for telemetry. Instead of one team using http_status while another uses response.code, everyone uses http.response.status_code. This standardization makes telemetry portable and queryable across all services.

The OpenTelemetry Collector acts as your telemetry router and policy engine. It receives data from instrumented applications, processes it (sampling, filtering, enriching), and exports it to your chosen backends - Prometheus, Jaeger, or commercial platforms like DataDog or Dash0.

Here's the key insight: OpenTelemetry does for observability what Kubernetes did for infrastructure. It creates a standardization layer that prevents vendor lock-in and eliminates tool fragmentation. You instrument once, then choose your analysis tools later.

The three pillars - logs, metrics, and traces - aren't isolated silos anymore. OpenTelemetry treats them as interconnected signals that collectively form a complete narrative of system behavior. Resource Attributes like service.name, cloud.region, and k8s.pod.name provide the semantic glue that links these signals together, enabling the cross-signal correlation that dramatically reduces incident resolution time.

Why platform engineers should care

Platform engineers face a dual mandate: observe infrastructure health while enabling developers with self-service observability capabilities. You need visibility into Kubernetes clusters, cloud resources, and CI/CD pipelines. Simultaneously, you must provide developers with effortless telemetry for their applications.

When implemented correctly, observability transforms from operational overhead into a capability multiplier. Teams using standardized telemetry see up to 44% faster incident resolution by eliminating conflicting labels and enabling seamless signal correlation.

The standardization problem manifests in three ways. First, custom labels and mismatched field names break queries across services. Second, tool proliferation creates data silos where metrics live in one system, traces in another, and logs in a third. Third, manual instrumentation doesn't scale - every new service requires custom setup, and every team invents their own conventions.

Cross-signal correlation is the superpower that OpenTelemetry enables. When a latency spike triggers an alert, you navigate immediately from the metric to the specific trace showing the slow downstream service, then to correlated logs revealing the responsible deployment change. This workflow - metric alert → specific trace → correlated logs - is only possible when shared Resource Attributes and trace context propagation link your telemetry together.

As Michele Mancioppi, Head of Product at Dash0, explains: "To use telemetry effectively to achieve observability, you really need to understand it. You need to know what that telemetry represents, from which system it's coming, and how different pieces of telemetry are related to each other."

Without standardization, you're managing complexity instead of reducing it. With OpenTelemetry, you establish the foundation for observability at scale.

Architecture: How the pieces fit together

OpenTelemetry's architecture centers on three components that work together to standardize telemetry collection and routing.

SDKs and instrumentation generate telemetry from your applications. Auto-instrumentation libraries inject observability into common frameworks without code changes - a Java Spring application gets traces and many metrics automatically when you add the OpenTelemetry Java agent. Manual instrumentation gives you fine-grained control for custom business logic.

Semantic conventions define the standardized vocabulary for telemetry attributes. When every service uses http.method, http.route, and http.response.status_code, your queries work across all services. These conventions cover HTTP, databases, messaging systems, and infrastructure components, ensuring consistency regardless of language or framework.

The OpenTelemetry Collector is your central control point. It receives telemetry via OTLP (OpenTelemetry Protocol), which processes it through configurable pipelines, and exports it to multiple backends simultaneously.

The Collector's processing capabilities give you centralized control without application code changes:

Sampling for cost control - Apply tail-based sampling to keep only interesting traces (errors, slow requests) while dropping routine traffic
Field redaction for compliance - Strip sensitive data like credit card numbers or personal information before telemetry leaves your infrastructure (requires explicit processor configuration and regular validation)
Backend routing - Send high-cardinality metrics to one system, aggregated metrics to another, and traces to a third based on team needs

This architecture separates concerns cleanly. Developers instrument applications using standard SDKs. The Collector handles routing, sampling, and policy enforcement. Platform engineers control the infrastructure without becoming a bottleneck for every instrumentation change.

The Collector also enables gradual migration. You can start by routing telemetry to your existing backends, then switch to new tools later without re-instrumenting applications. This flexibility is why OpenTelemetry prevents vendor lock-in - your instrumentation investment is portable across any observability platform that supports OTLP.

Getting started

For more comprehensive guidance, check out the Observability for Platform Engineering course created with Dash0, and review the Observability for Platform Engineers whitepaper for deeper technical details on platform engineering's dual observability mandate.

Frequently asked questions

Is OpenTelemetry free and open source?

Yes. OpenTelemetry is a CNCF project with Apache 2.0 licensing. All SDKs, APIs, and the Collector are free to use without vendor restrictions.

How difficult is OpenTelemetry to implement for platform teams?

Auto-instrumentation via the Operator makes initial implementation straightforward. Complexity comes from organizational adoption and optimizing sampling policies, not from the technology itself.

What's the difference between OpenTelemetry and Prometheus?

Prometheus is a metrics collection and storage system. OpenTelemetry is a broader framework that generates logs, metrics, and traces, and can export metrics to Prometheus.

How does OpenTelemetry reduce vendor lock-in?

By standardizing instrumentation separately from analysis. You instrument applications once using OpenTelemetry SDKs, then route telemetry to any backend that supports OTLP - switching vendors requires only Collector configuration changes.

What resources are needed to implement OpenTelemetry at scale?

One platform engineer can deploy the Collector and Operator in weeks. Scaling requires documentation, golden path examples, and ongoing support - plan for 20-30% of one engineer's time after initial deployment.

Join the Platform Engineering community and connect with peers on Slack.