Self-service observability: Consistency, cost control, and scale with OpenTelemetry

Krishan Sharma

Senior Software Engineer @ Fidelity Investments

•

Published on

February 25, 2026

Your observability bill is rising faster than your traffic. And the day you try to switch vendors, you’ll discover you’re locked in by tech - not contracts.

If you’ve ever supported observability across dozens of teams, you’ve seen this pattern repeat. At five teams, “best practices” plus a shared dashboard folder kind of works. Conventions spread organically, and inconsistencies are fixable with a quick nudge.

At fifty teams, that breaks. You become the bottleneck. Telemetry becomes inconsistent. Costs become political. This has previously been described as manual governance, where what works for a few teams collapses under many.

That’s the moment we stopped treating observability as “something teams implement” and started treating it as a capability the platform ships.

This is the story of how we moved from “teams doing observability” to self-service observability by default, by embedding OpenTelemetry into a NestJS-based Platform SDK used by 90+ internal teams - with cost guardrails, standardized semantics, and vendor-neutral export.

This direction isn’t unique. Microsoft has publicly described refactoring Azure’s native observability platform around OpenTelemetry as an industry standard for instrumenting applications and transmitting telemetry. DoorDash has shared how they adopted and tailored OpenTelemetry for custom context propagation at microservices scale, including rollout and security considerations. The industry is converging because the underlying problem is universal: observability can’t remain a per-team craft when systems and organizations scale.

The moment observability “exists”… but stops working

The breaking point usually isn’t missing telemetry. It’s untrustworthy telemetry.

You can’t build a cross-team latency dashboard because one service emits http.route correctly, another emits raw paths, and another emits nothing. Error-rate alerts become noisy because teams define “error” differently. Traces look “fine,” but they aren’t comparable across services because naming and attributes drift over time.

This is worse than having no observability. When teams stop trusting the data, they stop using it. Debugging reverts to guesswork and tribal knowledge - and the platform’s operational maturity quietly slides backward.

Then the cost shows up - quietly at first, then suddenly.

Why costs explode (and why it feels impossible to control)

Here’s the uncomfortable truth: most observability spending isn’t driven by “bad intent.” It’s driven by defaults.

One well-meaning decision can multiply cost dramatically by increasing cardinality and indexed volume. The usual culprits are familiar: putting identifiers into metric labels, capturing raw URL paths instead of normalized routes, or storing full GraphQL documents broadly “just in case.”

So the real pain isn’t “observability is expensive.” The real pain is that observability becomes unpredictable - and unpredictability is what gets budgets killed.

This is exactly why OpenTelemetry’s positioning matters. OpenTelemetry is explicitly designed as a vendor-neutral observability framework for generating, collecting, and exporting traces, metrics, and logs - and it’s widely supported by many vendors. When your telemetry production is standardized, cost control becomes a platform decision (sampling, redaction, cardinality guardrails), not a team-by-team guessing game.

Vendor lock-in doesn’t happen in the contract. It happens in your repo.

Most organizations discover lock-in too late. It doesn’t arrive with a procurement decision - it arrives when teams embed vendor SDKs into every service and build operational workflows around proprietary semantics.

At 90+ teams, “switching vendors” turns into a multi-quarter migration program: rewriting instrumentation, rebuilding dashboards/alerts, and retraining teams, all while still running production.

The platform move is to make the backend a pipeline decision, not a code rewrite. OpenTelemetry exists to make that separation real: standardize the telemetry you produce, then choose where you send and analyze it. Microsoft’s OpenTelemetry direction is such a strong signal because it reflects the same reality at cloud scale: standardization first, backend choice second.

The platform insight: Observability is a policy problem

Policy-as-Code works because it turns “please follow rules” into “the platform enforces the rules.” Observability needs the same treatment.

When teams have unlimited freedom to define naming, attributes, sampling, and capture depth, you don’t get flexibility, you get entropy. Platform teams already know the fix: define a contract, ship defaults, enforce guardrails, and provide escape hatches for advanced use cases.

For us, that meant treating observability like a platform policy with four pillars:

Standards: naming, attributes, error semantics
Guardrails: cardinality limits, sampling defaults, redaction
Automation: zero-config wiring, correlation by default
Auditability: versioned changes, predictable behavior

This structure made it possible to scale observability without scaling platform tickets.

What we built: Self-service observability inside a NestJS Platform SDK

We embedded an Observability capability into our NestJS-based SDK so every service generated or migrated onto it inherited the same behavior.

This wasn’t about adding “observability support.” It was about making observability the default state of a service.

For app teams, the promise was simple: your service ships observable on day one, no boilerplate, no debates. For leadership, the promise was equally clear: cost and consistency become engineered, not accidental.

Out of the box, services got a clean observability contract: standardized REST and GraphQL tracing semantics, golden-signal metrics aligned to platform conventions, logs correlated to traces, and vendor-neutral export through OpenTelemetry. Most teams never needed more than the defaults, and that was the point. The platform should solve the common case so well that customization becomes rare, intentional, and controlled.

The scar that made the system real

Here’s what separates real platform work from slideware: we initially allowed too much richness in emitted data.

Especially in GraphQL, it’s tempting to capture “everything” because it’s useful during debugging. But in a multi-team environment, “capture everything” becomes “everyone captures different things,” and that returns you to inconsistency and rising costs.

We saw three things happen quickly: telemetry became inconsistent again (teams chose different capture levels), cardinality risk increased, and cost stopped tracking traffic growth.

So we rolled back to safer defaults: tag GraphQL by operation name and type (not full documents), normalize routes, restrict attributes behind allowlists, and keep “deep capture” behind explicit, controlled toggles. DoorDash’s context propagation write-up highlights similar realities: standardization, rollout strategy, and security concerns become first-class at scale.

That rollback improved the system more than any new feature. It turned “powerful but dangerous” into “useful and sustainable.”

The cost model we used (simple enough to remember)

We needed an internal model that any engineer or leader could understand:

Spend ≈ Volume × Cardinality × Retention × Indexing

Once you view cost through that lens, the job becomes obvious: reduce the multipliers centrally.

We enforced three platform-level behaviors:

Cardinality guardrails (normalize routes, block IDs/free-form strings in metric labels by default, allowlist attributes in high-risk surfaces like GraphQL and headers).
Sampling defaults by environment (generous in dev/stage, tuned in prod, and an “incident mode” switch to temporarily increase sampling without redeploys).
Redaction by default (privacy and sensitivity treated as a platform responsibility, not a team best-practice checklist).

This is where platform observability earns trust: cost control becomes invisible to teams, but visible in budgets.

Why this scaled: adoption is a DX problem

We reached 90+ teams because the platform made observability the path of least resistance.

New services were observable by default. Dashboards worked immediately because semantics were consistent. Teams didn’t need to learn vendor-specific tooling to be successful, the platform handled the hard parts once and reused the pattern everywhere.

That’s the difference between an observability “program” and an observability capability.

Performance outcomes: how this enabled a 4× TPS story

OpenTelemetry didn’t magically improve throughput. What it did was make performance work repeatable.

With consistent spans, we could identify real latency contributors across services instead of arguing about measurement. With correlated logs, we reduced guesswork during investigations. With standardized metrics, we could compare before/after changes credibly across teams. Because the telemetry contract was stable, performance testing became far more self-service, teams could run tests and interpret results without platform hand-holding.

Observability stopped being reactive and became a performance multiplier, supporting work that contributed to a 4× TPS improvement in key workloads.

Three quotable truths

If observability isn’t standardized, it isn’t scalable.
Cost control isn’t finance work - it’s platform design.
Vendor lock-in is usually written in code, not signed in procurement.

When observability becomes self-service and standardized, teams stop asking “How do I instrument this?” and start asking the only question that matters:

“What should we improve next?”

Disclaimer: Views expressed are as of the date indicated and may change. Unless otherwise noted, the opinions provided are those of the author, and not necessarily those of Fidelity Investments.