Platform engineering

How SurveyMonkey built a self-service microservices platform

What happens when you need to build an enterprise-grade customer experience tool that handles almost half a billion requests daily while unifying multiple products? If you're Renato Mefi, former Head of Platform at SurveyMonkey, you just keep your eye on usability and forward expansion. You also build a custom self-service microservices platform that supports integrations, enriches data en route, and leverages ML. Simple.

...Of course, if you think that sounds easier said than done – and who wouldn't!? – you're in luck: In this webinar, Renato explains how all of the pieces come together with a strong focus on fundamental design principles. Here are some of his pointers on transitioning your software towards a more flexible, independent microservices solution.

Getting away from the monolith and moving towards microservices at SurveyMonkey

When Renato joined SurveyMonkey's GetFeedback team, the organization was beginning to move in a new direction. The company wanted to shift from its monolithic software architecture toward a developer-and-microservices-friendly alternative.

There were problems from the start. The first microservice the team extracted from the monolith and got running "on its own" wasn't 100-percent independent – It still used a "SELECT *" operation to pull data from the monolith! As Renato and his colleagues quickly realized, polling the database was a bad strategy that:

Carried an inherent risk of omitting certain types of data changes, such as when a deletion follows an insertion,
Involved a high-latency process that became even laggier when it involved data scraping, and
Bloated source data with extra metadata, such as "last modified" database columns.

As problems often do, these challenges joined forces to create a monster of sorts. Their combined lag alone made the act of developing noticeably harder – The team couldn't reliably route requests between old and new services for testing.

Learning from example

The way around these hurdles may come as a surprise: The team decided to delve into the database transaction logs for insight. If they could figure out what worked, then they could try to create something that performed the same tasks, but better.

For starters, the devs built on a more efficient change-data capture technique. Here, their choice of tooling helped, but as Renato points out, this pattern doesn't require special software because most databases already have some form of transaction table that delivers fast, high-quality feedback on changes. Importantly, the database log paints a big picture of differences that includes idempotency – the idea that an operation will produce identical results even following multiple applications – by default, unlike the stateful nature of the database itself.

After identifying what required monitoring, the devs wired up a system that pushed the relevant changes to Kafka. This is where Renato stresses the importance of building a diligent platform team that truly knows what it's doing.

Many of the new microservices depended on the ability to extract data from the old monolith. To accommodate, the team also built a central message broker service. Upon receiving a request, the message broker generates all of the tools necessary to satisfy requests from the new microservices, including runtimes, infrastructures, and monitoring endpoints. Each service specifies the tables it wants to capture, making interaction smoother.

No project plays out perfectly, and Renato accordingly points out that not every decision his team made was the best. For instance, he says things might have worked out better if the team had kept its configurations closer to the microservices that needed them. If he had to do it all over, he admits, he'd probably define these configs using custom K8s resource definitions that lived with their respective clusters.

Although the real implementation wasn't perfect, it was head-and-shoulders above the old polling solution: It allowed for database record extraction in real-time while revealing far more detail.

Thinking in patterns to make effective use of messaging

This migration was a case study in organized software design. The team applied Martin Fowler's classic Strangler Fig Pattern, gradually building a system adjacent to the old one until the new one supplanted its predecessor.

It's always smart to keep your objectives in mind at each stage of a major migration. Here, it was important for the microservices to become the source of truth as they replaced earlier implementations. In practice, however, this presented another issue: At some point, the microservices had to start passing messages to the broker.

Renato says a simple answer might involve each service sending the relevant messages after modifying the database. Unfortunately, this solution would introduce a dual-write problem – If the database write and message didn't occur in the same transaction, it could lead to heightened complexity.

For instance, if the broker returned a rejection, the system would need to decide whether to roll back the database transaction that prompted the unsent message. If both the database and broker failed to support similar protocols, like two-phase commits, the system could end up in a confusing or outright unsafe state.

Achieving event-based microservice consistency

The microservices had to be autonomous, but they also needed to remain synced. Faced with such a quandary, Renato and his team decided it was time to call in another design concept: the outbox pattern.

Imagine that some service wanted to complete a DB transaction, like writing to a table. In the same operation, it would also write all of the details that described an operation to a designated "outbox" table in the database. From there, the situation becomes an all-or-nothing proposition: Either the service writes to both tables or none.

Next, the team leveraged the transactional outbox relay pattern in a really clever way: The devs used their newfound appreciation for change data capture to achieve total data consistency. By listening to the transaction logs in search of outbox events, the relay maintained a cursor that always pointed to the current message.

With a well-designed message format, this solution eliminates the possibility of message duplication. Since messages adhere to a schema, the system supports multiple payload types. The outbox pattern also lends itself to disaster recovery by storing a record of inter-service messages – which might come in handy if you were using a non-durable message broker.

Making the Most of a Would-be Flaw

This project's push for autonomy also came with a side effect: The unavoidable data copying required to reduce the microservices' dependence on the monolith.

Renato's team chose to take advantage of the duplication. Initially, all of the services had their own Kafka consumers from which they'd pull messages to be persisted in the database. The new solution involved using an as-yet unexplored inbox pattern that included a read-only database schema for housing service data. You can think of it as each service having a local scrap of information but with the ability to project and shape the contents.

Fundamentally, the inbox-outbox pattern added massive value to the whole architecture, making it a BASE system that:

Served requests successfully even when some of the source data went missing,
Could be integrated into microservices via relatively straightforward configuration files,
Featured a soft state that could change even without input, and
Exhibited eventual consistency, syncing over time in the absence of inputs.

The application itself

This high-level discussion is all well and good, but it still leaves one big question: How do you implement the tooling for a service that inhabits such an ecosystem? For instance, it's not quite ergonomic to need to call a special function to write a message to the outbox every time you want to finish a database transaction. Human error is always going to be a factor, so you need to support simplicity and ease of use.

There are various ways to solve these kinds of problems, but Renato suggests building an organized, comprehensive framework: The platform team must deliver a full experience. Software architects can't just rely on a good pattern to do the heavy lifting. They also need to provide the monkey patches, middlewares, or dependency injection features required to promote ease of use and minimize error rates.

Working with containerized applications can potentially be a big help during messy migrations. Robust architectures depend on microservices being able to communicate with the underlying infrastructure. In other words, independent services demand lots of libraries, which is something containers excel at regularizing. Containerization also helps promote the correct use of configurations by giving devs a starting point for building new services.

Renato stressed the need to strike a healthy balance between flexibility and standardization. It's common for services to present a very opinionated way of doing things by default, but users must be able to pick and choose what they require. By giving their teams a framework that sets the tone for monitoring, tracing, health checks, logging, request de/serialization, and other crucial events and milestones, leaders can accelerate innovation – both avoiding costly mistakes and reducing preliminary lead-in time.

Assembling a microservices foundation for new use cases

Renato also touched on some other patterns that emerged as the service developed. For example, to iron out some of the messaging kinks, the team built a Kafka-to-HTTP pipeline integrated within the architecture. This improvement made the data more useful by supporting new functionality, like filtering messages, adding timestamps, modifying fields, and applying machine learning to analyze content.

The Kafka to HTTP pipeline exemplified how creating a service with the right capabilities can open up new fields of play. For instance, the pipeline could route and demultiplex messages. Building on this made it easier to realize a notification system that supported third-party apps and CRM endpoints.

Understanding the big picture

The final app used individual services not only to speed development but also to give devs more safety guarantees. The platform's users could rely on the fact that they were using the right pipeline configurations and message schemas even as they built new tools. Since the team provided boilerplate code, effective software design principles were the default.

In the end, Renato's example isn't the only way to migrate to microservices, but it's still worth considering. As software design strategies go, this approach incentivizes consistency and autonomy. It also eliminates many of the pain points that can make big architectural transitions so tricky.