Simple, Efficient, and Robust Hash Tables for Join Processing

submited by
Style Pass
2024-06-05 18:30:12

Hash tables are probably the most versatile data structures for data processing. For that reason, CedarDB depends on hash table to perform some of the most crucial parts of its query execution engine. Most prominently, CedarDB implements relational joins as hash joins. This blog post assumes you know what a hash join is. If not, the Wikipedia article has a short introduction into the topic for you. During the development of Umbra and now CedarDB, we rewrote our join hash table implementation several times. To share our latest design, TUM and CedarDB published a peer-reviewed scientific paper, which Altan will present at DaMoN'24 in Santiago de Chile next week.

While the paper is an excellent read by itself, this blog post will offer a more approachable description to the fundamental ideas behind our join implementation. In the following, we’ll:

Since hash tables are such a common data structure, there are already many high-quality hash table implementations. For general-purpose C++ implementations, abseil, ankerl, boost, and folly, are popular choices. Unfortunately, none of them are a good fit for processing joins in database systems, where we have a set of rather unique and demanding requirements:

Leave a Comment