August 16, 2022
 min read

Do NOT click-ops your data infrastructure

From spinning up VMs to managing application infrastructure across cloud regions, Infrastructure as Code (IaC) tools have been widely used and adopted. How about your data infrastructure? If you're running click-ops or complex scripts to build and manage your data infrastructure, read along. In this blog, I will discuss the need for an IaC approach for databases and streaming platforms.

Infrastructure - not just your application infrastructure

When we talk about infrastructure, we are typically referring to the application infrastructure. Let’s say you start with your core network (VPCs) and add an abstraction layer on top of that for handling incoming/outgoing traffic (security groups). Then you build and run your virtual machines (VMs). You put a load balancer in front of the VMs, you point the DNS to the load balancer, and finally you might have a content delivery network (CDN) for faster delivery.

But let’s pause for a moment and ask ourselves “Why do we build applications?”

To move data.

Whether it’s your mobile banking application or an application to move data between your enterprise data warehouse and cloud data warehouse, applications move data.

That’s why it's important to understand how we build, manage, and deploy our data infrastructure.

But what is data infrastructure?

Whether you’re using a relational database, a NoSQL database, a streaming platform or a combination of all of these, you are using datastores. How about the networking, observability, and security services for the data-related services? All of these combined are your precious data infrastructure. 

Infrastructure as Code (IaC) principles

Before applying IaC principles to data infrastructure, it’s important to know the principles first. 

Reproducibility: The first principle is that systems can be easily reproduced. Your IaC tooling/scripts should handle the build/rebuild of systems, and your engineers should not waste time arguing about how to choose a hostname.

Repeatability: The second principle is that processes are repeatable. Effective infrastructure teams have a strong automation culture. It’s easier to click-ops a one-off request rather than writing (and testing) a script. But what if you have 20 future requests to do the same task? Manual processes are also enemies of consistent systems, as you might not recall the exact configuration.

Disposability: The next principle states systems are disposable which is related to the common expression cattle, not pets. In large-scale cloud infrastructure, the failure of underlying hardware is not a matter of if but of when. Software should continue running even if some of the underlying hardware is modified. Service continuity in the cloud era depends on embracing disposable infrastructure.

Consistency: Consistent systems help you trust your automation. If your test environment and the production environment differ (in terms of vCPU, memory, or network performance), you cannot predict the reliability of your applications. Inconsistent infrastructure leads to configuration drift and vice versa.

Changing design: The final IaC principle is design is always changing. One way to ensure that a system can be changed safely and quickly is to make frequent changes. The benefit of a dynamic infrastructure is that change is not dreaded; it’s expected.

IaC challenges and how to tackle them

There are some valid concerns and challenges around IaC, so let’s have a closer look and how to address them.

  1. Resistance to learning: There’s a tendency to keep a system unchanged if it’s working. Though this practice might have worked for your final-year engineering project, you don’t want to wait until your production database is down to implement better tooling. 
  2. Security issues: According to Bridgecrew’s State of Open Source Terraform Security Report, nearly 1 in 2 public Terraform modules were misconfigured. This misconfiguration could lead to security issues. Public modules in Terraform are similar to code you find on the public internet. If you don’t run code directly from stackoverflow, you need to check configuration and security for public Terraform modules too. 
  3. Configuration drift and duplication of work: When you have IaC in practice in your organization, if some engineers still make changes manually or using custom scripts, that will lead to configuration drift and duplication of work. While you have a transition plan, the configuration drift and some duplication of work is normal for the time being. It’s also expected that not every part of the organization would jump on the IaC wagon. As long as the same system is not handled by both an IaC tool and manual/scripts at the same time, you’ll have a relatively smooth transition. 
  4. Handling unicorns: By unicorns, I’m not referring to companies with certain valuation. Rather, I’m referring to a mythical creature that looks like a flying horse with a single horn. Databases (relational databases to be more precise) are the unicorns in the IaC space.

The last point is very tricky, so let’s dive a bit deeper on how to address those challenges.  

How to perform a version upgrade without downtime?

For a major version upgrade, almost every cloud provider requires some downtime for their managed databases. There can be some mechanisms to minimize downtime but zero downtime for a major version upgrade is not possible. At the end this is not an objection to IaC since the downtime would have resulted whether you clicked a button on the UI, ran a script, or used an SDK to perform a version upgrade.

How to prevent a version upgrade without downtime?

Some IaC tools might allow a flag like “allow_major_version_upgrade”. Only if this flag is set to “true”, the service can perform a major version upgrade. This is a simple way to prevent a sudden version upgrade.

How to perform configuration changes without downtime?

Configuration changes can be as simple as adding a tag or as complex as changing encryption settings. For major configuration changes like changing the instance type, subnet, or encryption settings, there might be some downtime. For those changes, it’s ideal to perform the changes during a maintenance cycle and inform users in advance. 

A solution - Terraform

Terraform is an open-source IaC tool that you can use to build and rebuild your on-prem or cloud resources. Similar to your application code, you can version control Terraform code which is good for auditing purposes. 

Notice that the section title says “a solution” and not “the solution”. I recommend Terraform as an IaC tool because it's mature, easy to use, and open-source. Feel free to choose any IaC tool that fits your needs as long as it follows the general IaC principles.

Terraform talks to any available target API via providers. For example, the Aiven Terraform Provider lets you create Aiven resources (Aiven for Apache Kafka®, Aiven for PostgreSQL®, and a number of other data-related services) on the cloud of your choice. 

If this article piqued your interest, check out the Aiven Terraform cookbook to build and manage your data infrastructure using IaC principles. If you have any questions, please reach out.

* Click-ops: An ops task that is accomplished by clicking on a GUI rather than automation.

** Parts of this blog was originally published on the author’s personal blogging site 

Latest articles

Browse all