Zalando is Europe's leading e-commerce fashion platform, with 48 million active customers and 5,800 brands. Naturally, that kind of business volume comes with a massive engineering responsibility. So how did the company create an internal platform for its more than 1,000-strong force of engineers?
You'll find the answer in this Humanitec webinar with Henning Jacobs. As Zalando's Executive Principal Engineer, Henning Jacobs sheds light on what effective software development looks like at a successful international company. We've extracted the important details here if you can't watch the webinar.
Why build a cloud native Internal Developer Platform?
With employees from over 100 nations and thousands of developers, Zalando maintains a massive ecosystem.
How does such a big firm make its tooling more manageable without decreasing devs’ sense of ownership? For starters, the company decided to build an Internal Developer Platform (IDP) that played off of the "you build it, you run it" concept.
Of course, Henning’s team needed more than just a catchy philosophy. It was equally vital to ensure that devs enjoyed a common experience. Any platform the company created ultimately needed to serve as a viable daily starting point regardless of who was using it.
The ideal platform also needed to be definable and extendable via open-source tools – Zalando chose Kubernetes manifests and CloudFormation files as accessible stepping stones. Another major goal was to promote compliance by default, something very important for companies that serve customers in the EU.
Finally, Zalando wanted to improve how it operated over time. Quantifying software delivery performance and the happiness of its builders – including engineers, devs, and data scientists – seemed like perfectly logical choices.
The cloud native answer to the platform problem
To achieve these ends, Zalando built an internal platform known as Sunrise. Its main dashboard includes a quick view of tasks, performance indicators, and other operational vital signs. In short, it's a one-stop solution designed to help devs pick up unfinished work right where they left off.
Sunrise also aggregates common data points in one place and makes it easy to search for specific applications and APIs, identify owners, and launch support tickets. The platform integrates everything from application event publication and subscription to documentation.
Engineers and other builders can check out product portfolios to monitor services live, collaborate with teams and individuals, and troubleshoot CI/CD pipelines. Zalando team members can even bootstrap and deploy new apps using Sunrise.
Building Sunrise
This platform didn't just come together haphazardly: Zalando's mission was to integrate the entire builder experience into one tool while leaving room for expansion. It chose to accomplish this by building Sunrise on top of Spotify's Backstage.
Migrating from its previous internal system let Zalando leverage existing tooling. It also gave the company a chance to start fresh by building a truly multi-sided platform that anyone could contribute to.
At the same time, Backstage had its drawbacks, like being fast-moving enough to break frequently. Also, many of the plugins the company wanted to use needed customization to work properly.
Henning stresses that this architecture strove to overcome these negatives while making life easier for humans. For instance, automating the pull request system reduced the workload and made these configurations more manageable. From there, the configs determined how users interacted with AWS and other services.
Keeping things organized pays off because expansive architectures have large footprints. Sunrise is no exception: There are 90 production/non-production cluster pairs – each with dedicated AWS accounts and email groups. To manage it all, Zalando configures each cluster according to loosely designated product communities with their own managers and ownership.
One interesting pitfall you can learn from is that the devs initially tried using team namespaces. Henning now recommends against following in their footsteps. If you do, you'll need to migrate resources every time ownership changes!
Making Kubernetes your own
Want to build your own cloud-native dev platform? According to Henning, you'll probably want to get used to customizing the cloud.
Zalando's devs weren't afraid to modify how K8s worked to solve problems. For instance, Sunrise's custom kubectl wrapper includes special commands for requesting and approving emergency cluster access. This seemingly simple feature easily overcomes a huge barrier to responsive troubleshooting.
Access commands weren't the only functions Zalando added: Its stackset controller also let the team implement gradual traffic switching. To create it, the team built a custom K8s config that let it choose an appropriate tool – in this case, Skipper – to achieve the desired gradual traffic switching functionality.
Measuring experience
To make Sunrise as useful as possible, Zalando needed to understand precisely how the platform impacted the typical dev experience. It focused on four well-known delivery performance metrics identified by the popular "Accelerate: The Science of Lean Software and DevOps" book. At the same time, it added a unique twist, adapting these metrics to suit its practices:
Sunrise integrates with Lightstep OpenTracing to provide a range of functionalities. For instance, it can analyze service dependencies to create visual graphs of interactions. It also uses adaptive paging to identify the root causes of problems and notify the correct stakeholders.
How do you achieve compliance?
Compliance is of vital importance to an organization of Zalondo's size. Henning describes how the company depends on a multi-tiered system that achieves effective governance using:
- An asset registry that stores vital data about applications, including ownership, status, and custom alerts,
- Mandatory 4-eye PR approval,
- Approved Docker base images, and
- Mandatory k8s labeling and AWS tagging.
The wrap-up: what Sunrise achieved
Sunrise is a huge platform featuring thousands of services. So does Henning think it was worth the effort? Here's why the answer is a resounding yes:
Keeping everyone on track
The Sunrise developer platform subtly unifies the dev experience via its unique combination of tools, workflows, and a slick dashboard environment. It provides more explicit nudging in the form of a monthly builder newsletter that collects information from all teams. It also keeps people updated on new Sunrise features.
Working for the organization
During his talk, Henning repeatedly returned to the core assertion that Sunrise must support the needs of different audiences. For instance, different stakeholders proved more likely to use certain dashboards, but the system needed to accommodate everyone. To make this more viable, Zalando integrated a Sunrise primer into its standard onboarding program, giving new devs the support they need to master fundamental tasks.
Offering flexibility
Henning also elaborated on the tradeoffs between maintaining team autonomy and centralizing services. In the end, he said that Zalando valued established processes over local optimizations: Local solutions aren't always the best options for the company as a whole.
Teams that don't want to abandon their favorite hacks may resist switching to centralized services. When it comes to scaling at the corporate level, however, organization-wide solutions are far more effective. At the same time, centralized services can still accommodate the need to make changes, such as when you're working with vendor tools that might be opinionated.
Choosing where to invest
Henning also divulged a fascinating insight into how business decisions relate to internal platform development. Instead of using quotas to determine the size of its platform engineering team, Zalando takes a flexible approach that concentrates on the projects and concepts it deems investment-worthy. Considering what Sunrise has achieved with only a fraction of the company's total engineering workforce, this philosophy seems to have paid off.
A good example involves Zalando's dedicated data engineering. Sunrise accommodates these devs with a machine learning platform that consumes pipelines defined as Python scripts. Henning says that creating a flexible architecture made it easier to support adding these kinds of capabilities.
Sunrise lets Zalando pursue new productivity-boosting opportunities with a design mentality focused on productivity and developer happiness. These kinds of metrics can be notoriously hard to optimize for, however. Maintaining a flexible platform makes it possible to zero in on a team-by-team basis – and refine what you track to generate more useful data.