Apache DataFusion is now the fastest single node engine for querying Apache Parquet files

submited by
Style Pass
2024-11-23 12:00:02

I am extremely excited to announce that Apache DataFusion is the fastest engine for querying Apache Parquet files in ClickBench. It is faster than DuckDB, chDB and Clickhouse using the same hardware. It also marks the first time a Rust-based engine holds the top spot, which has previously been held by traditional C/C++-based engines.

Figure 1: 2024-11-16 ClickBench Results for the ‘hot’1 run against the partitioned 14 GB Parquet dataset (100 files, each ~140MB) on a c6a.4xlarge (16 CPU / 32 GB RAM) VM. Measurements are relative (1.x) to results using different hardware.

Best in class performance on Parquet is now available to anyone. DataFusion’s open design lets you start quickly with a full featured Query Engine, including SQL, data formats, catalogs, and more, and then customize any behavior you need. I predict the continued emergence of new classes of data systems now that creators can focus the bulk of their innovation on areas such as query languages, system integrations, and data formats rather than trying to play catchup with core engine performance.

ClickBench also includes results for proprietary storage formats, which require costly load / export steps, making them useful in fewer use cases and thus much less important than open formats (though the idea of use case specific formats is interesting2).

Leave a Comment