Chapter 2: Multi-armed Bandits

June 8, 2026

evaluative feedback - indicates the quality of the action taken

instructive feedback - indicates the correct action

k-armed bandit problem - repeated choice between $k$ actions, each producing reward from a stationary probability distribution conditioned on the chosen action

-action *value* - expected reward given that an action is selected

q_*(a)

is the expected reward of action

a

A_t

- action selected at time step

t

R_t

- resulting reward

q_*(a) = \mathbb{E}\left[R_t \mid A_t = a \right]

Q_t(a)

- estimated value of action

a

at time step

t

-based on these estimates, one action is highest at any point in time

-exploit current knowledge by taking the greedy action

-explore to improve estimates of non-greedy actions

-the need to balance exploration and exploitation is a distinctive challenge in reinforcement learning

-action-value methods estimate action values and make action selection decisions

-averaging received rewards is one way to estimate them

Q_t(a) = \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbb{1}_{A_i = a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_i = a}}

-over enough time steps,

Q_t(a)

approaches

q_*(a)

$\epsilon$ -greedy methods - exploit the highest known value with probability $1 - \epsilon$ , otherwise explore all actions uniformly

A_t = \argmax_a Q_t(a)

Incremental Averaging

sample averaging - move in the direction of the sampled reward target

\begin{align*} Q_{n+1} &= \frac{1}{n}\sum_{i=1}^n R_i,\\ &= Q_n + \frac{1}{n}\left[R_n - Q_n \right],\\ &= Q_n + \alpha\left[R_n - Q_n \right]. \end{align*}

\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize}(\text{Target} - \text{OldEstimate})

\text{error} = \text{Target} - \text{OldEstimate}

-step-size is often denoted by

\alpha

Non-Stationary Bandit Problems

-reward probabilities and values change over time

-more recent rewards should often receive more weight

-fixing

\alpha \in (0, 1]

gives the weighted average

Q_{n+1} = (1 - \alpha)^n Q_1 + \sum_{i=1}^n \alpha(1 - \alpha)^{n-i} R_i

R_i

is the reward after the

i

th selection of a given action

-this is an exponential recency-weighted average

$\alpha_n(a)$ - step-size parameter used after the $n$ th selection of action $a$

Conditions on $\alpha_n(a)$ :

1. \sum_{i=1}^n \alpha_n(a) = \infty \quad \text{ and } \quad 2. \sum_{i=1}^n \alpha_n^2(a) < \infty

1.steps must remain large enough to overcome initial conditions and fluctuations

2.steps must eventually become small enough to assure convergence

non-converging step sizes can be desirable in non-stationary environments

Optimal Initial Values

-let the initial estimate be optimistic, e.g.

Q_0 = [5] * k

with

\epsilon = 0

-since rewards are sampled from distributions centered near 0, updates quickly reduce these optimistic estimates

-the agent keeps switching actions as old optimistic estimates are revised downward

-this induces a kind of greedy exploration

-effective for stationary problems

-not ideal for non-stationary problems because exploration fades over time

Upper-Confidence-Bound Action Selection (UCB)

-a better exploration strategy may choose non-greedy actions based on their potential to be optimal

A_t = \argmax_a \left[Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}}\right]

c

controls confidence width

-the more often an action is selected, the larger

N_t(a)

gets and the smaller the uncertainty term becomes

-if an action is neglected for many steps,

\ln t

grows and its exploration bonus rises

Gradient Bandit Algorithm

H_t(a)

- numerical preference for each action

-does not directly represent reward

\pi_t(a)

- probability of taking action

a

at time

t

, using a softmax over preferences

\pi_t(a) = \frac{e^{H_t(a)}}{\sum_{b=1}^k e^{H_t(b)}}

Preference updates

H_{t+1}(A_t) = H_t(A_t) + \alpha(R_t - \bar{R_t})(1-\pi_t(A_t))

H_{t+1}(a) = H_t(a) - \alpha(R_t - \bar{R_t})\pi_t(a), \quad \forall a \neq A_t

\bar{R_t}

- average reward up to time

t

-if the realized reward is above average, preference for the chosen action increases

Associative Search (Contextual Bandits)

non-associative tasks - no need to match different actions to different situations

-stationary setting: find the single best action

-non-stationary setting: track the best action over time

associative tasks - require searching for the best actions and associating them with the situations in which they are best