Recommender systems work well when we have a lot of data on user-item preferences. With a lot of data, we have high certainty about what users like. Conversely, with very little data, we have low certainty. Despite the low certainty, recommenders tend to greedily promote items that received higher engagement in the past. And because they influence how much exposure an item gets, potentially relevant items that aren’t recommended continue getting no to low engagement, perpetuating the feedback loop.
Bandits address this by modeling uncertainty and exploration. By acknowledging the uncertainty in the data and deliberately exploring to reduce it, bandits learn about the relevance of unexplored items.
This is especially applicable when the item set changes quickly, such as for news, ads, and tweets, or when the rate of traffic is low. If new items are constantly added, waiting to collect batch data before retraining the model can be too slow. Bandits are a good fit as they can incrementally update with new data and adaptively focus on items with higher reward. This reduces regret, which is the opportunity cost while recommending suboptimal items.
We’ll briefly discuss three main bandit algorithms before looking at some industrial implementations of each. Here are a few terms I use throughout: (i) action/arm: recommendation candidates, (ii) reward: customer interaction from a single trial, such as a click or purchase, (iii) value: estimated long-term reward of an arm over multiple trials, and (iv) policy: algorithm/agent that chooses actions based on learned values.