Want to speak? Submit your talk and join our line up of speakers!
Community
Community
Overview
The story and values that drive us
Ambassadors
Become a Platform Engineering Ambassador
Events
Check out upcoming events near you
Reports
Check out the #1 source of industry stats
Jobs
Find your next  platform engineering role
Join Community
Join and contribute
Vendor opportunities
Certifications
Introduction to Platform Engineering
Platform Engineering Certified Practitioner
Platform Engineering Certified Professional
Platform Engineering Certified Leader
Platform Engineering Certified Architect
new
...and many more. Check out Platform Engineering University
Get Certified
For organizations
FOR ENTERPRISE TEAMS
Training
Advisory
Case study
Platform engineering at Fortune 200 scale
Case study
Platform engineering trainings. Surgically delivered.
FOR Partners
Service Provider
Training Reseller
Certified Provider Directory
BlogLandscape
Get certified
Join community
Join community
Get certified
All events
The journey to production AI: A practical guide for platform engineers and SREs
Virtual
In-person
The journey to production AI: A practical guide for platform engineers and SREs
May 5, 2026
7:00 pm
CEST
CET
-
60 minutes
Most teams are experimenting with AI agents. Few are running them reliably in production. This session breaks down how to move from first agent to trusted, repeatable workflows with real telemetry, strong context, and control over cost, accuracy, and scale.
Register
Watch recording
Speaker
Andre Elizondo
Head of innovation and AI @ Mezmo
Speaker
Speaker
Speaker

Getting a production AI agent to work in a demo is easy. Getting it to run reliably at scale - with real telemetry, bounded costs, and without hallucinating its way through your infrastructure - is an entirely different challenge. This post maps the journey from ad hoc experimentation to trusted production AI, covering context engineering, memory systems, governance, and how to earn autonomy incrementally.

Main insights

  • Production AI requires more than model selection - context engineering, memory, and governance are the hidden challenges that determine success at scale
  • The path to production AI follows five clear steps: choose the right harness, pick a bounded problem, engineer context, build memory, and earn autonomy gradually
  • Context engineering is the unlock for reliable production AI - relevance beats relatedness, and structure beats volume when managing finite context windows
  • Start with reversible, bounded use cases like incident investigation before moving to autonomous remediation

Andre Elazando, Head of Innovation & AI at Mezmo, brings over a decade of experience across security and infrastructure at companies like Wiz, Adobe, and Chef. Starting his career as a sysadmin, he developed a deep passion for cloud and SRE that now drives his work building production AI systems.

You can watch the full discussion here if you missed it: The journey to production AI

The iceberg problem: What's hiding beneath the surface

Most teams focus on the visible parts of building AI agents - choosing a model, selecting a framework, crafting prompts, and defining use cases. But these represent just the tip of the iceberg. The real challenges are below the waterline.

"What's typically missing is really everything down below," Andre explains. "How am I going to actually make sure that I'm not blowing out the context window? How am I actually going to make sure that the agent actually runs in a way that it gets smarter over time, it gets better over time?"

The hidden challenges include:

  • Context management - Preventing context window bloat while preserving relevant signal
  • Memory and learning - Ensuring agents improve over time rather than starting from scratch on each task
  • Governance and observability - Establishing full audit trails and understanding what agents are doing
  • Cost control - Managing token utilization to prevent runaway expenses
  • Versioning and GitOps - Tracking workflow changes and maintaining discipline at scale

According to Gartner research, fewer than 5% of organizations were running SRE tasks with agents in 2024, but that number is projected to reach 85% by 2029. The biggest barriers are precisely these below-the-surface challenges that emerge once you move past the demo phase.

Defining production AI: In production and for production

Andre frames production AI across two critical dimensions. First, the AI system itself must be production grade - trusted, repeatable, and observable. Second, it must be built for production - meaning it can safely handle the sensitivity and stakes of live environments.

Production grade systems rest on three core tenets:

  • Trusted - Full audit trails, measurable outcomes, and the ability to grade agent performance over time
  • Repeatable - Consistent workflows that deliver the same outcome every time, defined through simple configuration rather than custom code per agent
  • Observable - End-to-end visibility into agent actions, planning cycles, self-evaluation, and tool calls

The "for production" dimension matters because the failure modes are asymmetric. When an AI-generated code suggestion fails in development, you regenerate it. When an agent makes the wrong call in production infrastructure, the consequences can cascade quickly.

The five-step journey to production AI

Step 1: Choose the right harness

Rather than starting with general-purpose frameworks, select an opinionated <a href="https://www.platformengineering.org/tools/harness">harness</a> built specifically for production operations. "The opinions of how that agent should operate typically live within the harness," Andre explains. This eliminates the need to rebuild boilerplate for every agent and ensures production best practices are built in from day one.

Step 2: Pick a painful, bounded, reversible problem

Your first use case must meet three criteria:

  • Painful - Worth solving and run frequently enough to generate feedback
  • Bounded - Clear inputs, outputs, and ideally an existing runbook
  • Reversible - Mistakes can be corrected without catastrophic consequences

Incident investigation fits all three. "Choose a bounded problem, right, like something where I have very clear inputs, very clear outputs, ideally something that I've already defined in a runbook in my environment," Andre recommends. This gives you a tight feedback loop without exposing production systems to unchecked autonomous action.

Step 3: Engineer your context

Context engineering - the practice of deliberately curating what information enters an agent's context window - is the unlock for reliable production AI. Three principles guide this work:

  • Treat context windows as finite resources - Every token introduces latency and cost
  • Prioritize relevance over relatedness - Focus on what changed in the checkout service in the last hour, not everything in your environment
  • Structure beats volume - High-quality, curated signal outperforms flooding the agent with raw data

"It's a lot more meaningful to have relevance versus relatedness," Andre emphasizes. "I actually really would rather have the most relevant things based on what that agent is currently operating on."

Step 4: Build memory that compounds

Agents must get smarter over time, not reset on every task. This requires engineering memory systems that:

  • Persist learnings across investigations
  • Understand service relationships and typical patterns
  • Store and prioritize relevant historical context
  • Filter noise before it reaches the agent

"You actually want to kind of like focus on persisting memory for the agent so that the agent can better understand its environment over time," Andre notes. Without this, you're paying the full cost of context reconstruction on every run.

Step 5: Earn autonomy gradually

Teams can progress through three levels of automation:

  • Co-pilot - The agent suggests actions; a human approves everything
  • Assistant - The agent handles recognized patterns autonomously and escalates on novel scenarios
  • Autonomous - The agent operates independently with full audit trails for compliance

"We want to be able to get to the point where the agents that we're running in production are operating in a way that we trust them," Andre explains. "We get kind of more into the audit mode of I know I can visualize everything." Earning that trust incrementally is what separates sustainable production AI from brittle demos.

Real-world impact: From months to minutes

The proof is in production deployments. Rescale, using Mezmo's AI SRE functionality with the Aura harness, reduced investigation time from months to under an hour for complex incident response cases - including gnarly transient 503 errors that typically require extensive manual investigation across distributed systems.

"We take investigations from the matter of months to just a few minutes," Andre reports. That kind of operational leverage is only achievable when context, memory, and governance are treated as first-class engineering concerns.

Open source and community-driven development

Mezmo has open-sourced Aura, an agentic harness purpose-built for SRE and platform teams. The harness eliminates approximately 80% of agent-building boilerplate through simple TOML configuration files and makes agents observable and governed by default.

Aura supports model-agnostic deployment, working with OpenAI, Anthropic, Ollama, Bedrock, and any service exposing an OpenAI-compatible endpoint. The open-source approach enables community contribution and standardization - teams can adopt proven workflows from other SREs without rebuilding the underlying orchestration for memory, tool calls, and audit trails.

If you enjoyed this, find here more great insights and events from our Platform Engineering Community.

If you want to dive deeper, explore our instructor-led Platform Engineering Certified Professional course and connect with peers from large-scale enterprises who are driving platform engineering initiatives.

Key takeaways

  • Production AI requires engineering beyond the model - Context management, memory, governance, and observability are not optional extras. Teams that focus only on model selection and prompts will hit hard walls when moving from demo to production.
  • Start small and bounded, then graduate trust - Begin with reversible, well-defined use cases like incident investigation rather than jumping straight to autonomous remediation. Build confidence through measurable outcomes before expanding agent autonomy.
  • Context engineering is a first-class production concern - Finite context windows demand careful curation. Prioritize relevance over relatedness, structure over volume, and treat every token as a cost and latency consideration.
  • Choose opinionated harnesses over general frameworks - Production-focused harnesses encode best practices and eliminate boilerplate, making it easier to scale from one agent to ten or a hundred while maintaining consistency, observability, and governance across your agentic systems.
This event is exclusive. Reserve your spot now.
Register now
Watch recording
Join our Slack
Join the conversation to stay on top of trends and opportunities in the platform engineering community.
Join Slack
Sitemap
HomeAboutAmbassadorsCertificationsEventsJobs
Resources
BlogPlatformConCertified provider directoryWhat is platform engineering?Platform toolingVendor opportunities
Join us
Youtube
LinkedIn
Platform Weekly
Twitter
House of Kube
Weave Intelligence

Subscribe to Platform Weekly

Platform engineering deep dives and DevOps trends, delivered to your inbox crunchy, every week.

© 2025 Platform Engineering. All rights reserved.
Privacy Policy
Privacy PolicyTerms of ServiceCookies Settings
Supported by
Register now