Lessons learned from 100s of Infrastructure as Code (IaC) setups

Sören Martius
CEO @ Terramate

If you're organizationally invested in Infrastructure as Code, then you've probably heard of Terraform. As the most widely adopted IaC tool out there, it's the default choice for established enterprises and startups alike. 

You may have also noticed, however, that configuring Terraform properly demands a lot of work, especially when it comes to using it at scale. How can enterprises and organizations get past the initial rough patches, dodge the pitfalls, and chart a path toward Terraform setups that take growth in stride?

We wondered the same thing, so we asked Sören Martius, one of the founders and CEO of Mineiros, to share the findings he's gained by delving into hundreds of IaC setups in a quest to help companies and startups scale. Here's a recap of what he had to say.

The realities of IaC

Sören started with a quick refresher on what IaC entails and how Terraform can make it easier. For the uninitiated, Infrastructure as Code revolves around the use of configuration files to handle all things to do with infrastructure. You can build, change, and manage infrastructure in a safe, consistent, and repeatable way by defining versioned, reusable, and shareable resource configs.

  • Reusability and shareability: IaC lets you reuse configurations among several environments, gradually enhancing and improving them – or even building on existing code. 
  • Versioning: Storing configurations in a VCS makes development easier by offering a full history of changes you can roll back as needed. This works best when you write comprehensive commits and structure your changes in an organized way.
  • Speed: IaC promotes machine-executable automation for faster DevOps cycles.
  • Reliability: With IaC, you can ensure integrity through automated tests, code reviews, and static code analysis.

Documentation: Sören said that well-written code can be self-documenting, although he also advised adding comprehensive supplemental docs that describe what you're trying to achieve with practices like automation.

IaC is also way better than the alternative – what Sören referred to as "ClickOps" – where you manage infrastructure through a GUI. This practice is slow and prone to errors that only accumulate as environments gradually diverge. 

ClickOps practices typically lack versioning, eliminating any hope of clean audit trails. Since you can't reuse configs, it becomes impossible to roll them out to multiple environments. 

One of ClickOp's biggest weaknesses is that it's highly dependent on individuals. If your knowledgeable engineers who were in charge of configs jump ship, your infrastructure will be dead in the water until you can decipher the configurations they left behind.

So why do teams lean on Terraform for IaC? According to Sören, it has a lot to do with the tool's maturity and standardized workflows. Terraform's declarative configuration language, Hashicorp Configuration Language (HCL), gives you most of what you need to consistently "provision and manage all of your infrastructure throughout its lifecycle."

Sören pointed out that this last quote is merely the ideal: Terraform promotes IaC practices, but it doesn’t handle enterprise-level scaling as well as one might hope. As such, you're likely to run into several problems when you try to take your config management show to the big time.

Splitting code from state: structuring Terraform code for fluid scalability

Growing Terraform code bases can be problematic when it comes to optimizing execution times: CI/CD pipelines have to download numerous dependencies, run integrity checks, and compare your current configuration to your deployments in the cloud. In other words, the workflow scales with the number of resources you manage!

Companies that want to work at scale should consider how they can split their Terraform code and state into isolated units. Terraform facilitates a module pattern that allows you to structure related resources in reusable units, but that's only half the battle: Terraform modules neither provide isolation nor incorporate state. While they certainly help with consistency and organization, they don't prevent long run times.

To overcome these challenges, the Terraform community introduced the concept of stacks – runnable, stateful modules that operate on subsets of resources and incorporate independent provider configurations.

Sören recommended Mineiros' open-source Terramate tool as one viable option for managing multiple Terraform stacks. This tool makes it easier to adhere to common best-practice principles, like keeping code DRY, ensuring you generate valid code, automating change detection, defining specific stack execution orders, and running stacks in groups. 

According to Sören, alternatives like Terragrunt and Terraspace also work to provide similar benefits that can help you thrive at scale. Terramate’s big differentiator is its focus on functioning natively without requiring integration tests.

A layered approach to testing

Another piece of good advice from this talk involved cultivating an awareness of three different layers of tests: code reviews, static code analysis, and automated tests.

Sören called code reviews the first guardrail you should have in place when handling infrastructure: You’ll benefit greatly from establishing a review plan before executing. Always strive to give your teams the visibility they need to comprehend what's happening in CI runs, whether that entails clear GitHub PRs or descriptive diagrams for those who aren't so HCL-savvy. 

The next testing guardrail, static code analysis, helps you detect and avoid misconfigurations, vulnerabilities, and threats. For instance, routine analysis can help you uncover exposed secrets. 

Sören noted that companies can adopt good code analysis practices without needing to do everything manually. There are numerous commercial tools and solutions to help you get started. He also recommended that engineering teams use pre-commit hooks to format, validate, and lint code before pushing it to a repo, detecting misconfigurations in advance to save build time.

Finally, there's automated testing, which incorporates both unit and end-to-end testing. Similar to their parallels in non-infrastructure code testing, unit tests are for confirming the correctness of your Terraform plan's outputs. Integration testing involves deploying real infrastructures to validate that you can reach target endpoints. 

This distinction brings up an important point: Integration tests take time and money, and in some cases, they'll force you to confront edge cases, like when you can't clean up a deployed test infrastructure. Sören's advice for companies was to use automated processes to boost the observability of cloud environment maintenance – something Terramate and other tools can help you achieve.

Recognizing common hazards 

Sören's talk also covered a couple of things you definitely shouldn't do if you want your Terraform-backed IaC practices to scale effectively:

Branches ≠ good environment management

One potential pitfall involves using branches to manage environments. Although this seems like a good idea, it doesn't scale, so it's best to build on the natural organization structure provided by your VCS flows.

Your main branch ought to be the sole source of truth, and you should strive to represent your entire infrastructure using a single uniform hierarchy. 

Keeping these concepts in mind as you develop is particularly helpful when it comes to navigating and iterating through code base versions. Avoiding the trap of relying solely on branches also makes it easier to integrate different (or multiple) backends, share data, and promote concurrency.

Pin exact versions

A lot of companies think they can make IaC simpler by using dependency version ranges, but this often does the exact opposite. Failing to adhere to strict semantic versioning raises the odds that contributors might introduce breaking changes – even when pushing seemingly minor patches – which can ultimately lead to infrastructure failures or irrecoverable destruction. 

The solution here is to always pin exact versions when it comes to modules and providers. Sure, you can update minor versions automatically, but at the end of the day, you’ll benefit from being explicit about your dependencies: The risks of careless versioning just aren't worth it.

Design patterns made with smoother operations in mind

Sören wrapped up by stressing the goal of empowering your delivery teams to self-service whatever kind of infrastructure they need and avoid lengthy, ops-centric provisioning processes that are prime grounds for bottlenecks. He pointed to two important patterns that can make Terraform easier to wrangle at scale: Landing zones and service catalogs.

Landing zones

The idea behind the landing zone concept is to provide a basic framework for end users to consume. Landing zones let users self-service deploy environments you've previously configured to satisfy your organizational security policies and compliance guidelines. 

The landing zone design pattern brings numerous advantages to Terraform workflows, including:

  • Enabling faster delivery and self-servicing of provisioned environments for applications,
  • Preventing vulnerabilities by automating misconfiguration detection and mitigation, which also limits the fallout when problems occur, 
  • Reducing costs by centralizing oversight, such as by restricting spending in sandbox environments,
  • Increasing the efficiency of onboarding and offboarding teams and individuals while maintaining operational continuity, and
  • Optimizing security by sticking to the least-privilege principle, making it easier to comply with privacy and data-handling regulations.

Service catalogs

Sören said that while Terraform is a good technology for defining reusable service deployment patterns, you shouldn't stop there. Instead, create company-wide or team-specific service catalogs that enable your engineering teams to deploy services complete with the necessary backing infrastructure. 

Service catalogs can simplify deployments that involve multiple moving parts, like a service from Kubernetes that might require a database deployed on AWS, Google Cloud, or even a multi-cloud environment. They also make it easier to set standards, since you can create reusable patterns that you've vetted in advance to meet your scalability, compliance, and availability requirements. 

Service catalogs' power lies in how they abstract complexity away from developers and promote easy-to-use APIs. Conveniently, you can build catalogs on top of open-source base modules and even bundle multiple modules and resources together – without your users needing to master every last intricacy of the underlying services you've packaged together. 

The last word on Terraform at scale

Sören wrapped up by leaving us with a powerful reminder: Terraform is just a technology, and you really ought to be thinking of IaC in terms of workflows. In the end, the tools in your ecosystem should be the means, not the end – Your goal is to enable your customers, users, and developers to work efficiently and effectively.

This was a great talk that also included a 20-minute Q-and-A session. Sören also gave numerous examples of how real-world organizations use Terraform – for better or worse. For the full details, check out the talk here.