With the rise of complexity in the cloud, companies need a coherent abstraction to reduce cognitive overhead. Engineering is then able to maximize their attention on building and delivering product value. Most companies choose to build this abstraction via an internal platform team.
Unfortunately, not all platform stories are a success. For example, refer to this reddit thread. In an all too familiar story, the company the author worked at utilized standard cloud services and tooling (Github Actions, EKS, RDS, Terraform, Helm) and were saddled with operational overhead and yaml config hell. The author and their team spent months building a platform to abstract away the complexities, only for their customers, the developers, to reject the end product. What went wrong here? Without knowing the specifics of the situation, it is impossible to know for sure. However, in my experience, when platform engineers do not approach the problem with a specific mindset, they hit a common major failure: they do not build a platform their customers want.
The goal of platform engineering
To understand the platform mindset, we need to first understand what platform engineering is about. I propose we examine the discipline through the lens of the jobs to be done framework. Customers have a need, aka their “job to be done”. The job here could be anything, from transporting the customer from A to B or, as discussed in the article, providing Hershey’s Reese’s in an alternative packaging that does not require two hands to unwrap each individual piece. Crucially, in order for a customer to want a product, the product must solve the job to be done both functionally and emotionally. For example, the product to solve the Reese’s job to be done was Reese’s Mini, which comes bundled in an easily accessible bag and, just as importantly, does not leave a trail of wrappers that would remind the customers how many they ate.
How does the jobs to be done framework apply to platform engineering? The customers of platform engineers are other engineers, most commonly application engineers. The increasing complexity of the cloud has introduced much friction in engineers shipping their code. The obvious insight here is application engineers’ job to be done is to ship code without this friction. This idea is supported by many independent literatures in the platform community.
- When well executed, a platform strategy promises to reduce costs and allow product development teams to focus on innovation. - Cristóbal García García and Chris Ford from Thoughtworks
- Platform engineering reduces those [operational] costs, removing the obstacles that slow developers down. - Carlos Schults with Liatrio
- Platform engineers provide an integrated product … covering the operational necessities of the entire lifecycle of an application. - Luca Galante from Humanitec
However, as evident by the earlier reddit thread, it is possible to ship a platform that reduces friction only for engineers to not want it. When solving the application engineers’ job to be done, platform engineers must understand their needs both functionally and emotionally. Functionally, the author in the reddit thread provided a platform that helps application engineers get a handle on complicated terraform and YAML configs, which they need to interface with the rest of the infrastructure. But is a config generator what the application engineers want? Or, do they want an end to end workflow that abstracts away the complicated infrastructure entirely? To me, it is the latter. A platform that focuses on tooling instead of an end to end workflow ends up replacing a job with different jobs application engineers need to get done instead of solving the job completely. The emotional aspect is also useful. Do the application engineers want to be known for writing the best, most correct terraform configs? Or, do they want to be known as the fastest engineers at shipping features and getting greens on all their OKRs? Using the lens of the job to be done, I would argue that platform engineers must provide a coherent workflow that lets application engineers focus on writing and shipping code, not running them.
Dropbox’s Atlas: a workflows case study
In 2017, Dropbox kicked off its Service Oriented Architecture (SOA) initiative to break apart its monolith. By then, operating monolith was becoming untenable, with significant resources going into pushing and monitoring and tech debt piling up. The plan was to invest heavily into the tooling and infrastructure for running services and then incentivise product engineers to break their code out of the monolith with the usual promises of services, i.e. increased isolation, better control, flexible push cadences.
Fast forward to 2020, Dropbox had built some of the best infrastructure for running services, including a robust RPC framework, a powerful metric and alerting system, a fast build system, and a deeply integrated experimentation system, along with many other systems and frameworks that were not publicly featured. Gone were the days of software teams having to manage physical hosts and deal with hardware failures. Infrastructure teams had been running services for years and were able to ship code quickly. The story for product engineering, unfortunately, was less hopeful. Few product teams were running services, and ones that did had to spend months if not years splitting out their services, only to be saddled with operational responsibilities they were not trained for. The SOA initiative hindered, not enabled, them in performing their jobs.
What went wrong? The SOA initiative focused entirely on shipping powerful building blocks necessary to run services, from the technical perspective. It did not take customers’ needs into account and so did not provide a workflow that tied all these building blocks together. That responsibility was left up to customers, the product engineers.
In 2020, I led a new initiative called Atlas to tackle the problem from a different angle. The team set out to be a one-stop shop for all product teams to operate in production. Instead of asking product teams to learn how to run services, we built workflows that abstract away the complexity of running services. Product engineers write code as collections of related endpoints, and we as the platform team was responsible for taking these collections of endpoints and operating them in production. We create isolated clusters for them. We push them and notify the owning product teams of any regressions found on pushes. We automatically monitor their reliability and performance and escalate when needed. By the end of the project, the monolith was no more and both the product and infrastructure teams were happy.
Just because the Atlas team was focused on workflows did not mean we ignored the tech stack. In fact, because we first aligned on the necessary workflows with our customers, we were able to make improvements to the tech stack and eliminate years of tech debt. We deprecated the legacy web stack in favour of gRPC, bringing internal systems from HTTP/1.0 to HTTP/2.0. We reduced the internal serving overhead by over 50%. We replaced manual cluster sizing with autoscaling. We partnered with the traffic team to facilitate the Envoy migration (detailed separately in this blog post). Most of these improvements were not possible prior to the project, because the workflows were not aligned.
The cloud continues to get ever increasingly complex, with Hashicorp State of the Cloud Strategy Survey reporting 57% of answers struggling with skills shortage. The need for workflows will continue to rise. As part of a new endeavour, we have been talking to a wide variety of companies about their platform needs and solutions. We saw a heavy bias towards building blocks from existing platform solutions and a need for end to end workflows from companies that just want to get their work done. This F5 post also captures the workflow vs. building blocks (APIs) differences well. Essentially, would you bank with a bank that provides you a number of tools in multiple steps to deposit a check, or would you bank with one that provides you a workflow in one step? In order for a platform team (or a platform-as-a-product company) to be successful, the team needs to focus on defining and shipping end-to-end workflows, not tooling, building blocks or APIs.