Large-scale, distributed compute framework migrations are not for the faint of heart. There are backwards-compatibility constraints to maintain, perfo

Amazon’s Exabyte-Scale Migration from Apache Spark to Ray on Amazon EC2

submited by
Style Pass
2024-07-25 18:00:06

Large-scale, distributed compute framework migrations are not for the faint of heart. There are backwards-compatibility constraints to maintain, performance expectations to meet, scalability limits to overcome, and the omnipresent risk of introducing breaking changes to production. This all becomes especially troubling if you happen to be migrating away from something that successfully processes exabytes of data daily, delivers critical business insights, has tens of thousands of customers that depend on it, and is expected to have near-zero downtime.

But that’s exactly what the Business Data Technologies (BDT) team is doing at Amazon Retail right now. They just flipped the switch to start quietly moving management of some of their largest production business intelligence (BI) datasets from Apache Spark over to Ray to help reduce both data processing time and cost. They’ve also contributed a critical component of their work (The Flash Compactor) back to Ray’s open source DeltaCAT project. This contribution is a critical first step toward letting other users realize similar benefits when using Ray on Amazon Elastic Compute Cloud (Amazon EC2) to manage open data catalogs like Apache Iceberg, Apache Hudi, and Delta Lake.

So, what convinced them to take this risk? Furthermore, what inspired them to choose Ray – an open source framework known more for machine learning (ML) than big data processing – as the successor to Spark for this workload? To better understand the path that led them here, let’s go back to 2016.

Leave a Comment