Chapter 2: Multi-armed Bandits

June 8, 2026

evaluative feedback - indicates the quality of the action taken

instructive feedback - indicates the correct action

k-armed bandit problem - repeated choice between kk actions, each producing reward from a stationary probability distribution conditioned on the chosen action

-action *value* - expected reward given that an action is selected
-q(a)q_*(a) is the expected reward of action aa
-AtA_t - action selected at time step tt, RtR_t - resulting reward
q(a)=E[RtAt=a]q_*(a) = \mathbb{E}\left[R_t \mid A_t = a \right]
-Qt(a)Q_t(a) - estimated value of action aa at time step tt
-based on these estimates, one action is highest at any point in time
-exploit current knowledge by taking the greedy action
-explore to improve estimates of non-greedy actions
-the need to balance exploration and exploitation is a distinctive challenge in reinforcement learning
-action-value methods estimate action values and make action selection decisions
-averaging received rewards is one way to estimate them
Qt(a)=i=1t1Ri1Ai=ai=1t11Ai=aQ_t(a) = \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbb{1}_{A_i = a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_i = a}}
-over enough time steps, Qt(a)Q_t(a) approaches q(a)q_*(a)

ϵ\epsilon-greedy methods - exploit the highest known value with probability 1ϵ1 - \epsilon, otherwise explore all actions uniformly

At=arg maxaQt(a)A_t = \argmax_a Q_t(a)

Incremental Averaging

sample averaging - move in the direction of the sampled reward target

Qn+1=1ni=1nRi,=Qn+1n[RnQn],=Qn+α[RnQn].\begin{align*} Q_{n+1} &= \frac{1}{n}\sum_{i=1}^n R_i,\\ &= Q_n + \frac{1}{n}\left[R_n - Q_n \right],\\ &= Q_n + \alpha\left[R_n - Q_n \right]. \end{align*}
NewEstimateOldEstimate+StepSize(TargetOldEstimate)\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize}(\text{Target} - \text{OldEstimate})
-error=TargetOldEstimate\text{error} = \text{Target} - \text{OldEstimate}
-step-size is often denoted by α\alpha

Non-Stationary Bandit Problems

-reward probabilities and values change over time
-more recent rewards should often receive more weight
-fixing α(0,1]\alpha \in (0, 1] gives the weighted average
Qn+1=(1α)nQ1+i=1nα(1α)niRiQ_{n+1} = (1 - \alpha)^n Q_1 + \sum_{i=1}^n \alpha(1 - \alpha)^{n-i} R_i
-RiR_i is the reward after the iith selection of a given action
-this is an exponential recency-weighted average

αn(a)\alpha_n(a) - step-size parameter used after the nnth selection of action aa

Conditions on αn(a)\alpha_n(a):

1.i=1nαn(a)= and 2.i=1nαn2(a)<1. \sum_{i=1}^n \alpha_n(a) = \infty \quad \text{ and } \quad 2. \sum_{i=1}^n \alpha_n^2(a) < \infty
1.steps must remain large enough to overcome initial conditions and fluctuations
2.steps must eventually become small enough to assure convergence

non-converging step sizes can be desirable in non-stationary environments

Optimal Initial Values

-let the initial estimate be optimistic, e.g. Q0=[5]kQ_0 = [5] * k with ϵ=0\epsilon = 0
-since rewards are sampled from distributions centered near 0, updates quickly reduce these optimistic estimates
-the agent keeps switching actions as old optimistic estimates are revised downward
-this induces a kind of greedy exploration
-effective for stationary problems
-not ideal for non-stationary problems because exploration fades over time

Upper-Confidence-Bound Action Selection (UCB)

-a better exploration strategy may choose non-greedy actions based on their potential to be optimal
At=arg maxa[Qt(a)+clntNt(a)]A_t = \argmax_a \left[Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}}\right]
-cc controls confidence width
-the more often an action is selected, the larger Nt(a)N_t(a) gets and the smaller the uncertainty term becomes
-if an action is neglected for many steps, lnt\ln t grows and its exploration bonus rises

Gradient Bandit Algorithm

-Ht(a)H_t(a) - numerical preference for each action
-does not directly represent reward
-πt(a)\pi_t(a) - probability of taking action aa at time tt, using a softmax over preferences
πt(a)=eHt(a)b=1keHt(b)\pi_t(a) = \frac{e^{H_t(a)}}{\sum_{b=1}^k e^{H_t(b)}}

Preference updates

Ht+1(At)=Ht(At)+α(RtRtˉ)(1πt(At))H_{t+1}(A_t) = H_t(A_t) + \alpha(R_t - \bar{R_t})(1-\pi_t(A_t))
Ht+1(a)=Ht(a)α(RtRtˉ)πt(a),aAtH_{t+1}(a) = H_t(a) - \alpha(R_t - \bar{R_t})\pi_t(a), \quad \forall a \neq A_t
-Rtˉ\bar{R_t} - average reward up to time tt
-if the realized reward is above average, preference for the chosen action increases

Associative Search (Contextual Bandits)

non-associative tasks - no need to match different actions to different situations

-stationary setting: find the single best action
-non-stationary setting: track the best action over time

associative tasks - require searching for the best actions and associating them with the situations in which they are best