DevOps

Why observability and developer self-service are key trends in DevOps

Liz Fong-Jones
Principal Developer Advocate @ Honeycomb.io

With the massive growth in popularity of cloud computing and cloud-native services, there are more applications than ever before that rely on microservice architectures. This methodology is quickly becoming the modern standard for developing web applications, which begs the question - how do we effectively monitor these applications? 

Observability ensures that we can effectively and efficiently debug applications, which is why it needs to be a key consideration in the development cycle, particularly in production environments. Yet, as the number of environments and services used by any particular engineering team increases, observability becomes increasingly complex.

Liz Fong-Jones joined us for a webinar to discuss the challenges developers face, why microservice architectures need observability, and, answer one of the biggest questions in modern DevOps - how can DevOps teams provide engineers with observability data while obscuring complexity?

Challenges facing developers

As the modern software industry puts greater emphasis on shipping quality software faster, the ability to keep up with this demand will ultimately determine an organization’s success. This means that the more time developers spend on fixing bugs after release, the less time they have available to work on new features or new products. 

“Slightly less than half of a developer’s time is spent on unplanned work to fix code breakages rather than advancing the state of the business. We spend 17 hours per developer-week on things that don’t go according to plan, which is a giant waste of capital and human potential,” explains Liz. 

With that in mind, if developers can understand issues in code and debug them quicker, they’ll ship higher-quality software and features faster. In turn, they’ll be able to scale applications to the required demand and decrease downtime at the same time. However, this presents additional problems, particularly in an environment where developers throw code over the fence once it’s ready for deployment. 

“The reality is that passing code over the fence doesn’t make things any easier for developers. If the next team has a problem with the code and can’t figure out how to fix it, they ask the developer, but by this point it’s been months and the developer can’t remember what they wrote,” Liz says. So, solving bugs and fixing bad code takes far longer than if the developer could understand the issues before the code was passed on. 

Microservice applications create complex interactions, and the combination of this and user intent means that developers can rarely guard against everything that can go wrong. It’s also rare that each software application has only one user base, meaning that end users can often use applications in a way that developers could never anticipate or catch in staging. 

Unfortunately, monitoring alone is no longer effective. This is simply because it shows developers when something is wrong but doesn’t give any insights into the reasons why it’s wrong. As a result, organizations need a new way of thinking about things. This is where observability comes in. 

Why observability is vital 

At the core of debugging modern microservices is the ability to understand what causes a particular issue. Observability can give developers this understanding. It allows them to understand the internal state of an application by analyzing its external outputs. Often the first thing developers think of when it comes to observability is data. 

However, this is only one part of the equation. As Liz puts it, “When we think about observability, we think a lot about data. But how we gather the data is only a subset of the problem. We need to think about what are we using the data for and then figure out what our data strategy is, so we need to be able to understand what's going on with our code.”

Based on this, observability requires instrumentation code that produces data that can then be queried to provide this understanding. Unfortunately, in the industry, there’s a tendency to focus on only data and not how that data came to be or what steps were taken for those data points to exist. 

So, while developers both need aggregated summary statistics in the form of metrics, and logs that give detailed debugging information emitted by various processes, they also need access to a full trace of requests to pinpoint failures and performance issues. 

“The thing that ties data together is context propagation and tracing,” explains Liz. “A request goes through so many microservices, you need to be able to trace that request and derive statistics out of it.” 

Developers don’t just need a way to generate this data, however. They also need a place to store it, and a way to easily visualize the data so it can be put to use. While there are software development kits (SDKs) like Datadog and Sentry, these can be restrictive for teams that don’t want to use a vendor-specific SDK. That’s where OpenTelemetry comes in. 

What is OpenTelemetry?

OpenTelemetry is a vendor-neutral, open-source framework that supports tracing, context propagation, and metrics. It allows developers to instrument, generate, collect, and export metrics, logs, and traces that provide deeper insights into their software’s behavior and performance which, in turn, gives them a better understanding of what causes a particular problem. 

OpenTelemetry contains specifications for use across various languages, with definitions for data types and operations and requirements for language-specific implementation, configuration, and data processing. It also provides the tools to collect, transform, and export telemetry data and send that data to whatever backend developers are currently using. This SDK can also automatically generate telemetry data from popular libraries and frameworks. 

“The point of OpenTelemetry is to solve the problem of instrumentation and data implementation, but not the actual storage of data at rest and querying. It gives you the freedom and flexibility to choose your backend and change backends without ripping out all of your code so you can solve all of these problems no matter what scale you’re working at,” Liz says. 

Getting started with OpenTelemetry

The first step is to start with automatic instrumentation. It’s important to remember that, before making anything self-service, it should be trialed first. “Start with auto-instrumentation and before you make anything self-service, try it out yourself,” Liz advises.

 Once collected, this data can then be sent to a telemetry backend. There are multiple options available, so organizations should try out different providers in order to determine which suits a specific business need. 

 However, for the most insights into code, manual instrumentation based on specific, relevant business needs provide the most value. For example, adding custom attributes relevant to a specific business or creating spans that cover smaller units of work can increase the value of insights significantly. 

 “Finally,” concludes Liz, “Observability data is great for building and layering on high level constructs. Measuring end user base performance and creating service level objectives based off your traces. Also identifying with a service map how to find single points of failure during dependency cycles.  

Wrapping up

The best engineering teams enable developer self-service in order to deploy and ship quality and error-free code faster. As one of its foundations, this depends on developers being able to debug code quicker, which is not always as simple in complex microservice architectures. Observability is key in helping developers not only understand what’s gone wrong via specific points of data, but what specific actions occurred prior to those points of data being generated. 

Thanks once again to Liz Fong-Jones for hosting this meetup. If you missed the webinar the first time around, make sure to watch the full recording here

In case you want to dive deeper into observability, OpenTelemetry and SLOs, here are some more useful resources:

OpenTelemetry

CNCF Slack #opentelemetry channel

OpenTelemetry.io

github.com/opentelemetry

OpenSLO

OpenSLO Slack

OpenSLO.com