Platform engineering

How Palantir built their GitOps Internal Developer Platform

Palantir creates unique software that enables data-driven operations and decision-making, typically geared toward international governments and intelligence agencies. Although it initially picked GitOps for its dev-friendly workflows, this strategy wound up being significantly harder than expected.

Palantir chose the clear solution: building a better Internal Developer Platform for scaling GitOps. Of course, this journey took some work.

For more than five years starting in 2017, Greg DeArment, the company's Head of Production Infrastructure and Apollo Platform, enjoyed a first-hand perspective as Palantir refined its Apollo platform in support of Continuous Deployment (CD). As Greg shared in this talk, however, the transition wasn't always a success. Here’s the full story of how his team had to regroup multiple times to reach the goal.

Palantir’s Apollo Platform

Palantir works by offering multiple systems for its clients, including Foundry and Gotham. Although these products are distinct, they're both powered by the same underlying DevOps platform that Palantir built in-house: Apollo.

The company rolled out Apollo around 2017, marking its first foray into the production use of GitOps – though it had already begun investing in platform engineering and infrastructure as early as 2015.

Today, the Apollo platform serves hundreds of environments and services alongside more than 1,000 engineers working in four- to twelve-person teams. According to Greg, it's behind everything about how Palantir builds, deploys, and operates software in production.

Apollo can be found in hyper-scale cloud environments, on-premises environments for governments, and increasingly in edge environments. Greg said that all told, the platform typically saw thousands of deployments and configuration changes daily. So how did his team handle that much GitOps work?

How GitOps scaled at Palantir

Greg noted the roles ownership and organization played in the platform's ups and downs. The company divided its Apollo workforce into two main teams:

Greg's production infrastructure team included about 110 software engineers, DevOps engineers, and operators. These stakeholders were responsible for everything the company used to deploy and operate its customer-facing offerings, including the cloud infrastructure and the Apollo platform itself.
The internal infrastructure team comprised fewer than 20 engineers responsible for building first-party software. They also handled third-party software, such as GitHub, Circle CI, and SalesForce – basically, everything Palantir needed to operate as a modern company.

When it came to putting these formidable capabilities to work, Greg said Palantir had three main use cases for GitOps. He also posed a few questions that might help you inform the why behind your mission as you try to promote more GitOps-friendly platform engineering workflows:

Continuous Deployment service management: How would Palantir use the Apollo platform to deploy and manage the services that made up its customer-facing platforms?
GitOps for infrastructure: How could Palantir use GitOps to optimize its management of cloud infrastructure? When using third-party tools, how could the company transition to a state where systems like Kubernetes actually fulfilled their use cases, like running software?
Internal infrastructure: How could Palantir use GitOps to manage its developer tools and ecosystem?

Most of Greg's talk focused on the first use case. He also dropped some useful insights into the other two goals during the session – and discussed how success didn't come all at once.

Since Greg arrived at Palantir around 2017, Apollo adoption has grown linearly. This expansion reached more than 10,000 monthly pull requests by 2021! Despite the apparent success, Greg's team was in a constant battle to manage time to merge PRs.

Though Apollo had started as a massive accelerant for the company, it eventually made a burden of itself. He notes that although the platform automated a lot of tasks, like opening pull requests for every environment Foundry was running in, it took significantly longer to merge these requests compared to those opened by individuals. In some cases, requests would take a whopping 140 hours to resolve!

Searching for a problem-solving strategy

Something had to change, so in 2019, the team made some concerted investments geared toward lowering the number of PRs, such as restructuring repos to promote easier management.

While this strategy worked for a while, problems kept creeping back into the picture, requiring new solutions and fixes. Eventually, the company realized that its existing piecemeal approach wasn't worth the trouble.

While internal users' opinions on Apollo varied, Greg observed that their dissatisfaction typically scaled right along with their usage! Some notable sticking points included that:

Apollo's use of Git and GitHub made it tough to improve the user experience,
Adding new services required 100s of PRs,
There was a steep learning curve associated with editing YAML files across repos, and
The complexity of file-based permissioning made workflows inefficient.

Discrepancies between Git and production weren't easy to solve. Try as they might, the teams tasked with debugging these issues commonly encountered back-ups because they had to go through the teams that owned the GitOps system:

It was hard to understand the intermediate states caused by changes getting merged to Git but not being picked up by Apollo due to its many asynchronous systems failing to work together – or intentionally blocking pushes.
User experiences and workflows were suboptimal because Git wasn't the only determining factor in whether changes got applied. Because teams could implement specific maintenance and suppression rules, devs had to look at multiple information streams to determine what was going on when changes failed.
The Apollo engineering team constantly faced organizational bottlenecks, including questions and support tickets related to other issues.

One of Greg's most powerful takeaways concerned relying on dev tools to power production GitOps. Palantir could only respond to incidents if these systems were available and reliable:

Although the developer-tools-centric strategy did let the team get moving right away thanks to its familiarity with the tooling, it became a source of panic every time regularly scheduled maintenance or other interruptions occurred.
Palantir intentionally keeps its internal infrastructure and customer-facing environments separate in terms of security and identity. As you might expect, however, relying on internal systems to power the customer-facing environments made this goal much harder to achieve.
Apollo's internal infrastructure teams owned one set of the systems, and the production infrastructure teams owned a different piece. As those teams were in different parts of the organization, they had distinct cultures, priorities, and expectations. On more than one occasion, this led to unexpected friction that was hard to resolve.

Many of these challenges were undoubtedly related to the way Apollo did GitOps. Still, they're worth considering even if you're not using developer tools to power the bulk of your platform, particularly when it comes to understanding how different stakeholder needs might clash and make your platform less successful than you intended.

How the Apollo team rethought its approach to GitOps to help Palantir operate at scale

For about three years, Greg and his team slogged along, making incremental progress in improving their GitOps workflow user experience. Then in 2020, they decided to pursue more sustainable gains by rethinking their entire approach to GitOps with a focus on four broad goals:

1. Adhering to the GitOps principle while refining the implementation

Having a declarative, single source of truth with an auditable history and change management controls was a definite plus, but the team knew its old tooling was unsustainable. The new implementation would target the same functional features without tying Apollo to existing source control systems or tools.

2. Giving users a more tailored user experience for common GitOps workflows

Apollo engineers decided that users shouldn't have to interact with YAML files when it didn't make sense to do so. Instead, they'd aim for an intent-based, outcome-oriented workflow that let people deal with changes at the team level instead of on a per-repo basis, permitting organization-wide changes with less manual intervention.

3. Furnishing a better model for permissions and approvals

Instead of aligning permissions and approvals to the files being changed, the team chose to organize them by roles and responsibilities. This also applied to approval decisions, which would now be routed based on the contents of changes.

4. Providing clarity between target and current states

The team decided users should know when and why changes were pending. Leadership also wanted users to be able to view changes from a single viewpoint instead of having to hunt through hundreds of different repos or become experts in Palantir's GitHub repertoire.

The team achieved some of these improvements by redesigning existing Apollo features. For instance, modern versions of the platform include dashboards and forms that let users make changes to repos based on their intent and higher-level goals – instead of diving into the technical specifics or editing YAML files!

Greg's key takeaways

Greg wrapped up by discussing some of the broader lessons he'd gleaned from being at the driver's seat during Palantir's quest for GitOps success. For one thing, he said that compared to small teams, platforms that serve many types of users tend to have completely different GitOps experiences – As complexity increased in terms of the number of products and teams involved, so did the operational overhead of GitHub-powered GitOps.

Greg also cautioned that your growth might not be as tidy and predictable as you expect. The tools that get you started with GitOps aren't necessarily going to serve you equally well when it's time to scale.

The same goes for your approach to platform ownership and team structure: If you’re not careful, the same DevOps team that owns your GitOps infrastructure may end up becoming the bottleneck itself.

Greg's talk was a great insider look at how a massive organization can leverage GitOps and smart platform-engineering practices. Want to hear more tips on what to expect from your GitOps journey and Greg's informative responses to questions from the audience? Watch the meetup video.