Chapter II: From Parquet to Arrow

submited by

Style Pass

2024-05-09 16:30:04

In 2015, a discussion started in the Parquet community around the need for an in-memory columnar format. The goal was to enable vectorization of query engines and interoperability of data exchange. The requirements were different enough from Parquet to warrant the creation of a different format, one focused on in-memory processing.

As we started clarifying exactly what it meant to create an in-memory columnar format, it became clear to me that it looked awfully similar to the in-memory representation that the Apache Drill project had built for its own purpose, so I brought them in the discussion.

I had spent some time with the creators of Drill, discussing how to best read from Parquet into their in-memory representation and facilitating Parquet adoption within their project. As I found myself explaining Dremel’s definition and repetition levels to the Drill team and others, I figured I might as well do it in writing on the Twitter Engineering blog. I reproduced the “Dremel made simple with Parquet” blog post here.

The need for an in-memory columnar format comes from vectorization in databases. As computer architecture had evolved, query execution needed to adapt. MonetDB is a famous research database seminal to vectorization research. There are two main characteristics of CPUs that lead to this approach.