Apache Hudi is one of the leading three table formats (Apache Iceberg and Delta Lake being the other two). Whereas Apache Iceberg internals are relati

Understanding Apache Hudi's Consistency Model Part 1

submited by

Style Pass

2024-04-24 16:30:09

Apache Hudi is one of the leading three table formats (Apache Iceberg and Delta Lake being the other two). Whereas Apache Iceberg internals are relatively easy to understand, I found that Apache Hudi was more complex and hard to reason about. As a distributed systems engineer, I wanted to understand it and I was especially interested to understand its consistency model with regard to multiple concurrent writers. Ultimately, I wrote a TLA+ specification to help me nail down the design and understand its consistency model.

Hudi being more complex doesn’t mean Iceberg is better, only that it takes a little more work to internalize the design. One key reason for the complexity is that Hudi incorporates more features into the core spec. Where Iceberg is just a table format for now, Hudi is a fully blown managed table format with multiple query types. If you are well-versed in Delta Lake internals, you will also see that the Hudi design has a number of similarities to that of Delta Lake.

This analysis does not discuss performance at all, nor does it discuss how Hudi supports different use cases such as batch and streaming. It is solely focused on the consistency model of Hudi with special emphasis on multi-writer scenarios. It is also currently limited to Copy-On-Write (COW) tables. I’m starting with COW because it is a little simpler than Merge-On-Read and therefore a better place to start an analysis.