A major goal of data quality monitoring is to ensure that data consumers like BI dashboards, ML training jobs or data APIs have a sufficient guarantee

Why you should monitor data at the object storage layer

submited by
Style Pass
2024-05-10 15:00:07

A major goal of data quality monitoring is to ensure that data consumers like BI dashboards, ML training jobs or data APIs have a sufficient guarantee of the quality and integrity of the backing data assets. 

A lot of attention has been given to observing data quality at the database and warehouse layers. This makes sense given that endpoints consume directly from a database/warehouse and transform tasks like dbt-workflows both read from and write to them. 

Nevertheless, we have seen many cases where detecting issues after data has been materialised into a datastore is a sub-optimal workflow:

Compute cycles are wasted on load and transform before data issues have been identified and subsets of data need to be re-processed 

Increasingly popular formats like Parquet, Iceberg and Delta Lake allow analysts to preempt the warehouse and query directly on object storage  

Non-analytics data like some types of training data may never make it into a database/warehouse but are still used for business critical applications

Leave a Comment