The Global Data Warehouse team at Uber democratizes data for all of Uber with a unified, petabyte-scale, centrally modeled data lake. The data lake consists of foundational fact, dimension, and aggregate tables developed using dimensional data modeling techniques that can be accessed by engineers and data scientists in a self-serve manner to power data engineering, data science, machine learning, and reporting across Uber. The ETL (extract, transform, load) pipelines that compute these tables are thus mission-critical to Uber’s apps and services, powering core platform features like rider safety, ETA predictions, fraud detection, and more. At Uber, data freshness is a key business requirement. Uber invests heavily in engineering efforts that process data as quickly as possible to keep it up to date with the happenings in the physical world.
In order to achieve such data freshness in our ETL pipelines, a key challenge is incrementally updating these modeled tables rather than recomputing all the data with each new ETL run. This is also necessary to operate these pipelines cost-effectively at Uber’s enormous scale. In fact, as early as 2016, Uber introduced a new “transactional data lake” paradigm with powerful incremental data processing capabilities through the Apache Hudi project to address these challenges. We later donated the project to the Apache Software Foundation. Apache Hudi is now a top-level Apache project used industry wide in a new emerging technology category called the lakehouse. During this time, we are excited to see that the industry has largely moved away from bulk data ingestion towards a more incremental ingestion model that Apache Hudi ushered in at Uber. In this blog, we share our work over the past year or so in extending this incremental data processing model to our complex ETL pipelines to unlock true end-to-end incremental data processing.