Platform engineering

So long Hadoop - moving data platforms to Kubernetes

Erik Schmiegelow
CEO @ Hivemind Technologies AG

So long, Hadoop: Moving data platforms to Kubernetes

Over the last two decades, data engineering platforms have emerged as forces to be reckoned with. Unfortunately, so has the baggage that typically comes with them.

One of the biggest problems is how to uncouple data engineering workflows from their distant origins. Projects like Hadoop are highly capable. But they were designed for an era when containerization, cloud-native, and all the associated benefits were just emerging. In other words, these implementations are still serviceable, but they're suboptimal.

Erik Schmiegelow, CEO at Hivemind, explained how to remove all that baggage. In his talk at PlatformCon 2023 he explained why Kubernetes (K8s) is ideal for data engineering and shared why you ought to bid Hadoop farewell.

Hadoop and Spark: Data engineering vs other forms of platform engineering

Eric started by taking us back to the early 2000s — when Google published two whitepapers on its MapReduce programming model and the Google File System (GFS). These papers were electrifying for the industry because they shared how Google solved data problems at scale and prompted the Apache Hadoop project as a practical implementation in 2006. 

MapReduce deals with processing and generating big data sets on a cluster via a parallel, distributed algorithm. GFS, a distributed file system, provides dependable access to data using clusters of extremely low-cost hardware.

The MapReduce and GFS models were both cost-effective and easy to reproduce, catering to major users like Facebook and Yahoo, which swelled to tens of thousands of nodes by around 2011. But there was a glaring catch.

As Eric put it, MapReduce was simple, but it sucked. Its much-touted simplicity was its downfall: it was made for mapping data and sending it to reduce jobs, nothing more. If you wanted to achieve anything more complicated, you had to resort to chaining multiple MapReduce stages together. While frameworks like Cascading helped with coding, they did little to enhance execution.

What’s wrong with using Hadoop today?

Hadoop was great in its prime but as an implementation of MapReduce and GFS, it inherited their weaknesses. In short, Hadoop's main components (its HDFS distributed filesystem and YARN job scheduler) build on ideas that aren't as cutting-edge as they once were.

One way to look at it is that YARN is a sort of primitive ancestor to later frameworks like K8s. Although the latest version of YARN supports Docker, Eric said few people go that route — the alternative is so much more popular and well-supported.

…Back to the history lesson. Eventually, enough was enough. Fed up with Hadoop, a team at UC Berkeley's AMPLab decided to build something with increased reliability and abstraction: Spark.

Unlike other parallel processing systems, Spark lets users describe jobs and computation tasks in directed cyclical graphs. It also catered specifically to scientific users instead of engineers by including high-level APIs and bindings for other languages popular among data scientists.

Spark commonly runs on Hadoop, but it was designed to support running elsewhere like on Kubernetes and Mesos. This wasn’t the only reason for its popularity, however.

One of Spark's major strengths was its compatibility with different data sources and formats. You could plug in files from Google Buckets, S3, and HDFS, or use streams like Kafka or even Twitter feeds.

You could also write results back to data stores like Cassandra and Elastic Search. Ordinary relational databases weren't out of the question either, so Spark also picked up steam in standard enterprise computing and application development.

Spark vs microservices

Want a high-level understanding of what Spark does and how it differs from microservices? Eric advised us to think of it like a freight train compared to microservices as a high-speed passenger train. Spark processes large amounts of data slowly but effectively, taking diverse payloads from one point to another. Microservices focus on doing a specific job quickly, rarely carrying massive payloads.

Another big differentiator is that Spark can dynamically scale up or down to match the workloads you want to process. Unlike most microservices frameworks, it can plan and allocate resources on a per-job basis using a context model that plans out task execution and division of labor via a driver program. It then uses the underlying cluster's management system to furnish resources as needed.

Some microservices contain elements of this model, but Spark comes with it all baked in. Coordination is automatic and dynamic, and the system knows how to dissect big tasks into independent units that run on worker nodes (Executors in Spark lingo). Spark can also watch outcomes and restart workloads as needed, eliminating manual planning and coding.

Exploring the Spark runtime — and bumping up against Hadoop

Although Spark improved on the early concepts that underlie Hadoop, looking at the runtime reveals an interesting connection: about 60% of Spark deployments still run on Hadoop! Most tie into cloud infrastructures via provider-specific Hadoop distributions for platforms like AWS, GCP, etc.

These Hadoop distributions are meant to encapsulate the Hadoop hassle from users. But they can't erase the costs. You'll typically spend 20% to 25% more on Hadoop-ready managed instances than you would on instances for other cluster technology. On the other hand, this increases the ease of running on managed services because these instances come with object stores specific to each platform.

Running Spark in these infrastructures is a reliable way to accomplish big data tasks using mature implementations. In addition to there being plenty of learning resources and integrated tooling to help, you can take advantage of built-in features geared toward simplified deployment and batch management. You also get excellent managed service support from most major public cloud providers.

There are cons, however. Instead of containers, you're using VMs, and ramping up clusters can be slow and costly. Spark workloads are also inherently siloed from the outset and don't integrate well with broader application landscapes. When it comes to versioning, they're tied to the cluster environment you're running.

The K8S alternative

Luckily, you're not stuck with Hadoop. As Eric reminded us, Spark started as an application framework that supported multiple clusters. Today K8s support is generally available, and there are even two handy run modes to choose from:

Spark-submit

This mode is the traditional, no-fuss standard. It uses Spark's tooling to run a script that packages the application and stages the job for you. Finally, it runs the workload on the cluster using the driver to plug in all the bits and pieces along the way.

If there's a disadvantage to this, it's simplicity. Spark-submit works out of the box, but that comes at the expense of being unable to specify anything.

Spark Operator

This mode is a bit more involved but way more flexible. It uses custom K8s resources and lets you adopt a declarative approach to application specs and management. You get to configure the instance and node types. You can even tweak volumes, config maps, and profiles to your heart’s content.

Unlike Spark-submit, you don't have to settle for the cluster defaults. This flexibility is a huge win for security and versioning. You can tie down security details and customize the version you're running against as long as you provision an appropriate image for the operator.

Why run Spark on K8s?

You’re probably already partially convinced that Spark on K8s is the better option, but Eric drove the point home with some benefits we haven’t covered yet:

  • You can run your Spark apps and microservices in a uniform ecosystem. This is hugely advantageous for platform engineering teams that can't maintain multiple ecosystems.
  • You can use containers to test your Spark applications. Working with containers was way tougher on Hadoop. K8s eases the process of running different Spark versions and customizing security contexts. Even better, you can do so at the application level instead of per cluster.
  • Your apps become more portable across public clouds and environments. With Hadoop, you can't quite port across different clouds due to incompatibilities.
  • Your apps become more portable across public clouds and environments. With Hadoop, you can't quite port across different clouds due to incompatibilities.
  • Running Spark takes significantly less money on K8s. For example, you'll often save around 20% of the cost of a comparable AWS deployment. Cluster deployment is also faster.

Migrating to Spark on K8s: Is it worth it?

Eric noted that running Spark on K8s still needs work to become an ideal experience. One of the big remaining problems involves availability: Not everything is ready out of the box, so you'll likely need to configure certain artifacts and tools.

Dynamic allocation may pose another potential challenge. The job planning and resource allocation frameworks that helped Spark succeed on Hadoop aren't quite as mature in the K8s world. As a result, dynamic allocation isn't as easy to use, and it can also eat up a significant chunk of your cluster resources.

This disparity essentially boils down to a matter of time. Spark on Hadoop has about a 10-year lead over the K8s version.

Still, Eric said this is the time to forge ahead. Despite the minor hiccups, the benefits of using K8s far outshine the old Hadoop way. In other words, if you haven't migrated all your data applications to K8s, you ought to start.

The same goes for checking out the Platform Engineering YouTube channel. PlatformCon 2023 was packed with learning to help you become a better developer and platform engineer.

Case in point, the webinar for this talk included an insightful Q&A session that went deeper into the origins of Spark and Hadoop. Eric also disclosed more of the realities of using both options and gave a few situation-specific tips on migrating to K8s. Don't miss it if you're planning on transitioning!