This article introduces a taxonomy for data transformations in AI applications that is fundamental for any AI system that wants to reuse feature data

A Taxonomy for Data Transformations in AI Systems

submited by
Style Pass
2024-07-04 08:00:02

This article introduces a taxonomy for data transformations in AI applications that is fundamental for any AI system that wants to reuse feature data in more than one model. The taxonomy consists of model-independent data transformations that produce reusable features (for example, total customer spend in the last week), model-dependent transformations that produce features that are specific to one model (for example, normalizing that customer spend value using the mean and standard deviation of all customer spends in the model’s training dataset), and on-demand transformations, found in real-time AI systems, that require data only available at request-time (for example, transforming longitude/latitude to a zipcode). The data transformation taxonomy shows you what data transformations to apply in which AI pipelines, and potential sources of offline-online skew. We finish the article comparing how existing feature stores fare in addressing the taxonomy, and the event-sourcing pattern requires full support for the taxonomy.

This article is part 2 in a 7 part series describing in lay terms concepts and results from a SIGMOD 2024 research paper on the Hopsworks Feature Store.

Leave a Comment