This blog post shows the results of a study that investigated how to optimize point-in-time joins in Apache Spark. The findings of this study show tha

A Spark Join Operator for Point-in-Time Correct Joins

submited by

Style Pass

2022-07-01 18:30:11

This blog post shows the results of a study that investigated how to optimize point-in-time joins in Apache Spark. The findings of this study show that the best and most consistent results were achieved by utilizing the Early Stop Sort-Merge join. This join operator is what will come to power the point-in-time join functionality of Hopsworks in the near future.

When constructing datasets to use for machine learning models, data scientists sometimes need to look at historical data to obtain previously occurring scenarios. By looking at the previously occurring scenarios, the model is able to find correlations between the different features of the observed entity or entities. A feature is a measurable attribute of an entity whose behavior our model is trying to predict. When constructing our desired dataset, we need to construct the feature values of an entity at a specific point in time. Although, as the feature values may reside in different datasets, updated at different frequencies, it is not always possible to utilize an exact time-based join to obtain the desired result.

The illustration that can be observed in Figure 1. displays how point-in-time correctness can be achieved. The hexagons in the illustration represent feature update and target prediction events. The prediction targets (marked in dark red) specify a point in time where we want to extract the most recent feature values for some entity. It is important to not leak information from the future (light red / pink events) into the training dataset, since it will produce scenarios that never actually occurred. The only events we want to consider are the green events, as these are the most recent feature values at the time of the red event.