In our conversations with clients, we've found that Spark is a go-to tool for many data engineering and analytics tasks. It's battle-tested,

Benchmarking in the Real World: Apache Spark™ 3.5.0 on EC2

submited by
Style Pass
2024-04-04 09:30:04

In our conversations with clients, we've found that Spark is a go-to tool for many data engineering and analytics tasks. It's battle-tested, robust, and capable of handling virtually any amount of data. However, a common challenge we hear about is the difficulty in managing an organization's Spark deployment. Which clusters should be used? How many clusters are needed? Should you use EMR or EMR Serverless?

The answer often depends on the specific characteristics of your organization's data and workloads. However, there are some general guidelines that can help improve your total cost of ownership (TCO) without resorting to "it depends." At Underspend, we've benchmarked various Spark programs on different EC2 instances and found a difference of over 100% in the cost of running a given program. The key takeaway is that it's worth investing time to consider the instances being used for Spark, as the cost difference can be significant.

Our results are based on a real-world PySpark program provided by one of our clients. The program does not use UDFs and is translated to clean Spark code, running on Spark 3.5.0. While the TPC-DS benchmark is commonly used and definitely has its uses, we've found that the program we used is more representative of what companies run in the real world. Here are our findings:

Leave a Comment