Scaling from 2K engineers to 12K engineers and 10K deploys per day
What do you do when your organizational maturation follows an almost classical exponential pattern?
It may sound like you've achieved the dream, but there's a catch—or several of them. From meeting compliance requirements to choosing the ideal product capacities and efficient CI/CD workflows, light-speed growth can be as much of a curse as a blessing.
But why let the hurdles trip you up? With the right Internal Developer Platform (IDP) strategy you can scale and succeed, even if you're going from 2,000 to 12,000 engineers and 26,000 microservices. In this webinar, Lucia Brizuela and Juliano Martins shared just how they achieved that at Mercado Libre and offer lessons to inform your platform journey.
Choosing an Internal Developer Platform: Investing in the right tooling at the right time
Some life lessons are more powerful in retrospect. The Mercado Libre growth story couldn't be a better example.
Today, Mercado Libre is Latin America's largest eCommerce and payment ecosystem. Its modern microservices architecture handles 50 million requests per second and 10,000 deploys every day.
How does this all get done? Scale plays a big role. Mercado Libre relies on three cloud providers and on-premises architecture, trains more than 700 machine learning models daily, and maintains a library of some 26,000 microservices!
At the turn of the century however, things were a lot different. And not just because the cloud was still a fresh concept to most. Back then, Mercado Libre was still somewhat of a big player, but it was getting the job done with a Java monolith.
Why create a platform?
So what changed? As Juliano told it, success was a double-edged sword. Mercado Libre was growing and by 2010, the company faced several problems that had outpaced the original monolith.
Mercado Libre hadn't quite lost its way, but it had definitely lost its time to market. Deployments and errors went hand-in-hand. Rollbacks were a constant fact of life.
Many issues defied debugging efforts due to sheer complexity. Eventually, the company's devs found themselves with little choice. They had to break down the monolith to make it more manageable.
This worked for a while; the team could chip away at systems until they reached the ideal microservice size. They even gained some agility and reduced time to market in the process.
The good times wouldn't last, however. More microservices meant more cognitive load: The knowledge required to accomplish critical tasks was increasing.
Governance, compliance, and security also took hits. It was impossible to tell at a glance whether devs were using best practices or applying updates properly.
Perhaps worst of all, front-line teams had to bear the brunt of the burdens. Having to master an army of technologies and struggling with inconsistent practices doesn't exactly make for a low-friction, rewarding developer experience (DevEx).
The platform solution
Creating better-organized, more cohesive tooling was the clear answer. As Lucia explained, things kicked off in 2015 when the team started building an internal platform that would come to be known as Fury.
From the onset, the Fury project had three major objectives: Saving money, enhancing governance and control, and improving security. Their platform needed to answer any possible developer need with solutions that fostered consistent overall experiences. It also needed to reduce complexity and decrease time to market by providing reusable components and repeatable workflows.
That all sounds great in theory. Unfortunately, the team faced a problem that plagues most organizations that want to adopt internal platforms: Getting developers to abandon their old day-to-day for a fresh new look.
The how: Selling the platform
Lucia's team solved the problem by focusing on a few hallmarks of any good platform:
- Ease of use: Eliminating hurdles helped people start using the platform right away.
- Crafted user experiences: The platform team set its sites on a highly focused, purposeful DevEx. They ensured devs could concentrate on product design, not configuration, infrastructure, or maintenance.
- Fluid, problem-free rollouts: The platform engineers started with the most critical microservices and worked to prove they ran smoothly. This made it easier to justify migrating other workloads.
Concentrating on these goals didn't just result in a great platform with clear benefits. It also helped change developers' mindsets to earn buy-in.
This didn't happen overnight, but catering to the users proved worthwhile. By promoting voluntary adoption, the organizational culture naturally followed.
Divvying up IT talent for scalable efficiency
Your platform's technology is just part of the story, and it can work against you. It's all too easy to fall in love with a specific tool, even if it's not the ideal fit. You need to organize your teams so that developers don't fall prey to biases or develop unproductive habits.
For Mercado Libre, this philosophy can be summed up into a succinct motto: The organization distributes teams into products and capabilities. The platform and the services it provides occupy the lowest layers of this hierarchy.
How to do abstraction right
Juliano offered the example of a financial services product that determined a user's credit rating. The end-user product builds on machine learning and risk-scoring capabilities to determine whether to offer credit, prevent fraud, and perform other tasks. These capabilities are discrete units that run on top of the platform—and the developers can reuse them as needed when it's time to create new products.
There's another huge advantage to building a platform with clearly delineated capabilities separate from product concerns; standardized workflows. Mercado Libre's developers can go from zero to production in four painless, consistent steps:
1. Creating an application via the UI: The developer chooses languages and other framework elements from a subset of preapproved options. After just a few minutes, they have a fully scaffolded project repo ready to clone and start working with.
2. Creating web scope: The developer selects autoscaling, infrastructure, traffic management, logging, and other options—with the platform providing suggestions on what might work best.
3. Opening a pull request: Upon opening a pull request, the developer triggers automatic quality checks, validation, linting, and metrics assessments, all of which happen before a version gets created.
4. Deploying a version: Fury lets developers deploy using their choice of strategy on any infrastructure scope and monitor the code running in the appropriate environments. Mercado Libre can then centralize business practices like auditing, rollbacks, and security changes.
This four-step process isn't just easy to follow. It also lets the developers take useful abstractions for granted. Instead of creating pipelines and repos or allocating compute resources, users can do what matters most: solving business problems and creating new products.
Building a platform that does the job(s)
Behind the curtains, Fury is more than just a bunch of apps and cloud services. It also includes a UI, web interface, and CLI for developers. The platform engineering team has a virtual "back office" that lets them work on the platform. SDKs and APIs make it easy to build applications that interact with various abstracted services.
Being able to scaffold native mobile applications, APIs, machine learning models, and other microservices in just a couple of clicks is a promising start for any platform. But it's just as important to facilitate effective governance that optimizes the DevEx.
Fury achieves this by providing users with feature-rich summary dashboards. Devs can easily see how much they're spending (or wasting!) on infrastructure and get cost estimates of different technology strategies.
The platform also makes automated recommendations for improving performance and lets users apply these changes without impacting uptime.
The Mercado Libre team learned quite a lot from building their platform. Here are three highlights particularly worth remembering:
1. Controlled freedom has value
Organizations that use platforms must accept that there are trade-offs between developer freedom and control. Striking a balance between flexibility and centralized control will boost adoption, but there's no universal rule of thumb.
So what's the answer? Stop looking for a perfect catch-all metric from day one. Instead, be open to your performance measures hardening organically as your platform evolves with time.
Be sure you instil in your developers the idea that A4 paper doesn't stifle creativity. Compliance and standardization are prerequisites for teamwork.
2. Not everything is beautiful
Platforms solve many problems, but they aren't all glamorous. You have to provide support and training. It's the best way to overcome adoption hurdles and gain acceptance without outright imposing tooling on developers.
Treat your developers like you would treat any other users. Don't stop at forming support teams to handle customer tickets. Also, assign technical account managers and subject matter experts who'll have your devs' backs when they're struggling in the trenches.
Remember: You should help your devs understand your platform and implement appropriate solutions. Invest in Knowledge Management teams to guide them through onboarding and training.
The platform components, workflows, and functions that seem obvious to you won't always feel so intuitive to users. Give yourself the leeway needed to handle tickets gracefully, especially when you're releasing new features.
3. Your platform won't meet all cases
There's no universal answer to every development use case. This can represent an organizational risk, but there's a solution: choose appropriate governance models for outlier situations.
Developers are creative, that's why you hire them! Be ready for what might happen when they experiment and push the boundaries of your platform.
Many problems occur in nonstandard edge cases, like sandboxes, proofs of concept, and unproven third-party integrations. Use infrastructure as code, training, and support to enforce as much structure as you can.
Fury is just one example of how a well-planned, proactively managed internal platform can help software architects keep up with demand. There are many ways to sustain organizational growth. Building a capable platform is one of the best options for striking a balance between flexibility and standardization.
There was a lot more in this talk, including how Mercado Libre used Fury to survive the Log4j vulnerability and keep on trucking at scale. It's a must-see for anyone considering building a platform, so watch the full webinar for the details.