evaluative feedback - indicates the quality of the action taken
instructive feedback - indicates the correct action
k-armed bandit problem - repeated choice between k k k actions, each producing reward from a stationary probability distribution conditioned on the chosen action
- action *value* - expected reward given that an action is selected
- q ∗ ( a ) q_*(a) q ∗ ( a ) is the expected reward of action a a a - A t A_t A t - action selected at time step t t t , R t R_t R t - resulting rewardq ∗ ( a ) = E [ R t ∣ A t = a ] q_*(a) = \mathbb{E}\left[R_t \mid A_t = a \right] q ∗ ( a ) = E [ R t ∣ A t = a ] - Q t ( a ) Q_t(a) Q t ( a ) - estimated value of action a a a at time step t t t - based on these estimates, one action is highest at any point in time
- exploit current knowledge by taking the greedy action
- explore to improve estimates of non-greedy actions
- the need to balance exploration and exploitation is a distinctive challenge in reinforcement learning
- action-value methods estimate action values and make action selection decisions
- averaging received rewards is one way to estimate them
Q t ( a ) = ∑ i = 1 t − 1 R i ⋅ 1 A i = a ∑ i = 1 t − 1 1 A i = a Q_t(a) = \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbb{1}_{A_i = a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_i = a}} Q t ( a ) = ∑ i = 1 t − 1 1 A i = a ∑ i = 1 t − 1 R i ⋅ 1 A i = a 💡
Strong Law of Large Numbers
1 n ∑ j = 1 n R j ( a ) ⇒ E [ R ∣ A = a ] \frac{1}{n} \sum_{j=1}^{n} R_j^{(a)} \Rightarrow \mathbb{E}[R \mid A = a] n 1 j = 1 ∑ n R j ( a ) ⇒ E [ R ∣ A = a ] TLDR: with enough samples, the average reward for action a a a approaches its expected reward.
- over enough time steps, Q t ( a ) Q_t(a) Q t ( a ) approaches q ∗ ( a ) q_*(a) q ∗ ( a )
ϵ \epsilon ϵ -greedy methods - exploit the highest known value with probability 1 − ϵ 1 - \epsilon 1 − ϵ , otherwise explore all actions uniformly
A t = arg max a Q t ( a ) A_t = \argmax_a Q_t(a) A t = a arg max Q t ( a )
Incremental Averaging sample averaging - move in the direction of the sampled reward target
Q n + 1 = 1 n ∑ i = 1 n R i , = Q n + 1 n [ R n − Q n ] , = Q n + α [ R n − Q n ] . \begin{align*}
Q_{n+1} &= \frac{1}{n}\sum_{i=1}^n R_i,\\
&= Q_n + \frac{1}{n}\left[R_n - Q_n \right],\\
&= Q_n + \alpha\left[R_n - Q_n \right].
\end{align*} Q n + 1 = n 1 i = 1 ∑ n R i , = Q n + n 1 [ R n − Q n ] , = Q n + α [ R n − Q n ] . NewEstimate ← OldEstimate + StepSize ( Target − OldEstimate ) \text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize}(\text{Target} - \text{OldEstimate}) NewEstimate ← OldEstimate + StepSize ( Target − OldEstimate ) - error = Target − OldEstimate \text{error} = \text{Target} - \text{OldEstimate} error = Target − OldEstimate - step-size is often denoted by α \alpha α
Non-Stationary Bandit Problems - reward probabilities and values change over time
- more recent rewards should often receive more weight
- fixing α ∈ ( 0 , 1 ] \alpha \in (0, 1] α ∈ ( 0 , 1 ] gives the weighted average Q n + 1 = ( 1 − α ) n Q 1 + ∑ i = 1 n α ( 1 − α ) n − i R i Q_{n+1} = (1 - \alpha)^n Q_1 + \sum_{i=1}^n \alpha(1 - \alpha)^{n-i} R_i Q n + 1 = ( 1 − α ) n Q 1 + i = 1 ∑ n α ( 1 − α ) n − i R i - R i R_i R i is the reward after the i i i th selection of a given action- this is an exponential recency-weighted average
α n ( a ) \alpha_n(a) α n ( a ) - step-size parameter used after the n n n th selection of action a a a
Conditions on α n ( a ) \alpha_n(a) α n ( a ) :
1. ∑ i = 1 n α n ( a ) = ∞ and 2. ∑ i = 1 n α n 2 ( a ) < ∞ 1. \sum_{i=1}^n \alpha_n(a) = \infty \quad \text{ and } \quad 2. \sum_{i=1}^n \alpha_n^2(a) < \infty 1. i = 1 ∑ n α n ( a ) = ∞ and 2. i = 1 ∑ n α n 2 ( a ) < ∞ 1. steps must remain large enough to overcome initial conditions and fluctuations
2. steps must eventually become small enough to assure convergence
non-converging step sizes can be desirable in non-stationary environments
Optimal Initial Values - let the initial estimate be optimistic, e.g. Q 0 = [ 5 ] ∗ k Q_0 = [5] * k Q 0 = [ 5 ] ∗ k with ϵ = 0 \epsilon = 0 ϵ = 0 - since rewards are sampled from distributions centered near 0, updates quickly reduce these optimistic estimates
- the agent keeps switching actions as old optimistic estimates are revised downward
- this induces a kind of greedy exploration
- effective for stationary problems
- not ideal for non-stationary problems because exploration fades over time
Upper-Confidence-Bound Action Selection (UCB) - a better exploration strategy may choose non-greedy actions based on their potential to be optimal
A t = arg max a [ Q t ( a ) + c ln t N t ( a ) ] A_t = \argmax_a \left[Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}}\right] A t = a arg max [ Q t ( a ) + c N t ( a ) ln t ] - c c c controls confidence width- the more often an action is selected, the larger N t ( a ) N_t(a) N t ( a ) gets and the smaller the uncertainty term becomes - if an action is neglected for many steps, ln t \ln t ln t grows and its exploration bonus rises
Gradient Bandit Algorithm - H t ( a ) H_t(a) H t ( a ) - numerical preference for each action- does not directly represent reward
- π t ( a ) \pi_t(a) π t ( a ) - probability of taking action a a a at time t t t , using a softmax over preferencesπ t ( a ) = e H t ( a ) ∑ b = 1 k e H t ( b ) \pi_t(a) = \frac{e^{H_t(a)}}{\sum_{b=1}^k e^{H_t(b)}} π t ( a ) = ∑ b = 1 k e H t ( b ) e H t ( a )
Preference updates
H t + 1 ( A t ) = H t ( A t ) + α ( R t − R t ˉ ) ( 1 − π t ( A t ) ) H_{t+1}(A_t) = H_t(A_t) + \alpha(R_t - \bar{R_t})(1-\pi_t(A_t)) H t + 1 ( A t ) = H t ( A t ) + α ( R t − R t ˉ ) ( 1 − π t ( A t )) H t + 1 ( a ) = H t ( a ) − α ( R t − R t ˉ ) π t ( a ) , ∀ a ≠ A t H_{t+1}(a) = H_t(a) - \alpha(R_t - \bar{R_t})\pi_t(a), \quad \forall a \neq A_t H t + 1 ( a ) = H t ( a ) − α ( R t − R t ˉ ) π t ( a ) , ∀ a = A t - R t ˉ \bar{R_t} R t ˉ - average reward up to time t t t - if the realized reward is above average, preference for the chosen action increases
Associative Search (Contextual Bandits) non-associative tasks - no need to match different actions to different situations
- stationary setting: find the single best action
- non-stationary setting: track the best action over time
associative tasks - require searching for the best actions and associating them with the situations in which they are best