A multi-armed bandit problem is one in which you are faced with a number of options (often called arms or variants in A/B testing). You need to decide

Simon Ward-Jones | Thompson Sampling

submited by
Style Pass
2024-11-22 11:30:06

A multi-armed bandit problem is one in which you are faced with a number of options (often called arms or variants in A/B testing). You need to decide which one to choose to maximise some reward. The problem is that you don’t know how good each option is and you need to balance between trying out new options (exploration) and choosing the best based on what you know so far (exploitation).

Thompson sampling is a strategy for balancing exploration and exploitation in a multi-armed bandit problem to maximise reward.

To make this more concrete let’s imaging we our trying to improve click through on the homepage of a website. Imagine we have developed three different options which we could show the user. We want to know which is the best at getting users to click through! In this example the reward is deemed to be 1 if the user clicks and 0 if they don’t.

In a classic A/B test, if we had three options A, B and C (assuming C is the current control experience) you would randomly assign users to one of the three options and then measure the click through rate (CTR) for each option once the experiment has ran for the pre-decided time (likely based on a power calculation)

Leave a Comment