Apache Hudi is a streaming data lake platform that brings core warehouse and database functionality directly to the data lake. Not content to call its

Building Streaming Data Lakes with Hudi and MinIO

submited by
Style Pass
2022-09-21 19:00:20

Apache Hudi is a streaming data lake platform that brings core warehouse and database functionality directly to the data lake. Not content to call itself an open file format like Delta or Apache Iceberg, Hudi provides tables, transactions, upserts/deletes, advanced indexes, streaming ingestion services, data clustering/compaction optimizations, and concurrency.

Introduced in 2016, Hudi is firmly rooted in the Hadoop ecosystem, accounting for the meaning behind the name: Hadoop Upserts anD Incrementals. It was developed to manage the storage of large analytical datasets on HDFS. Hudi’s primary purpose is to decrease latency during ingestion of streaming data.

Over time, Hudi has evolved to use cloud storage and object storage, including MinIO. Hudi’s shift away from HDFS goes hand-in-hand with the larger trend of the world leaving behind legacy HDFS for performant, scalable, and cloud-native object storage. Hudi’s promise of providing optimizations that make analytic workloads faster for Apache Spark, Flink, Presto, Trino, and others dovetails nicely with MinIO’s promise of cloud-native application performance at scale.    

Companies using Hudi in production include Uber, Amazon, ByteDance, and Robinhood. These are some of the largest streaming data lakes in the world. The key to Hudi in this use case is that it provides an incremental data processing stack that conducts low-latency processing on columnar data. Typically, systems write data out once using an open file format like Apache Parquet or ORC, and store this on top of highly scalable object storage or distributed file system. Hudi serves as a data plane to ingest, transform, and manage this data. Hudi interacts with storage using the Hadoop FileSystem API, which is compatible with (but not necessarily optimal for) implementations ranging from HDFS to object storage to in-memory file systems.    

Leave a Comment