The Explore vs. Exploit Dilemma

submited by
Style Pass
2024-10-12 05:00:03

Suppose we have a series of decisions to make, each with the potential to yield a reward. In our multi-armed bandit problem, we aim to develop a strategy to maximize this reward over time. We envision each “arm” as a slot machine, each one hiding a different reward distribution. Our task is to identify which arm to pull at each step in time to accumulate the most reward.

If we consider t=0t=0 t = 0 as our starting state—knowing nothing about the reward distributions—and t=1t=1 t = 1 as the ideal state—where we have complete knowledge of the best arm—then we can define a function between ignorance and an optimal selection. In this framework, we can imagine a vector field guiding us from exploring new arms to exploiting the most rewarding ones.

Our state of knowledge at time tt t can be denoted ϕt(x)\phi_t(x) ϕ t ​ ( x ) , representing our expected reward for each arm, updated after each trial. We can define the expected reward flow ϕt(x)\phi_t(x) ϕ t ​ ( x ) as:

Leave a Comment