Chapter 4: Dynamic Programming

June 8, 2026

-dynamic programming (DP) algorithms compute optimal policies given a perfect MDP model of the environment
-this is uncommon in real RL, but DP provides the conceptual foundation
-DP algorithms come from turning Bellman equations into update rules for improving approximations to value functions

Bellman optimality equations

v(s)=maxaE[Rt+1+γv(St+1)St=s,At=a]v_*(s)= \max_a \mathbb{E} [R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t = s, A_t = a]
q(s,a)=E[Rt+1+γmaxaq(St+1,a)St=s,At=a]q_*(s, a) = \mathbb{E}[R_{t+1} + \gamma \max_{a'}q_*(S_{t+1}, a') \mid S_t = s, A_t = a]

Policy Evaluation

policy evaluation - computing the state-value function for an arbitrary policy π\pi

vπ(s)=E[Rt+1+γvπ(s)St=s],=aπ(as)s,rp(s,rs,a)[r+γvπ(s)]\begin{align*} v_{\pi}(s) &= \mathbb{E}[R_{t+1} + \gamma v_{\pi}(s') \mid S_t = s],\\ &= \sum_{a} \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_{\pi}(s')] \end{align*}
-if the environment dynamics are fully known, this is a system of S|\mathcal{S}| equations

iterative policy evaluation - approximate vπ(s)v_{\pi}(s) with repeated updates under a fixed policy

vk+1(s)=aπ(as)s,rp(s,rs,a)[r+γvk(s)],kv_{k+1}(s) = \sum_{a} \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_k(s')], \quad k \rightarrow \infty

Policy Improvement

-once you have a better estimate of vπ(s)v_{\pi}(s), you can improve the policy itself

policy improvement theorem - let π\pi' and π\pi be deterministic policies such that

qπ(s,π(s))vπ(s),sSq_{\pi}(s, \pi'(s)) \geq v_{\pi}(s), \quad \forall s \in \mathcal{S}

then

vπ(s)vπ(s)v_{\pi'}(s) \geq v_{\pi}(s)

Thus the new greedy policy is

π(s)=arg maxaqπ(s,a)\pi'(s) = \argmax_a q_{\pi}(s, a)
π(s)=arg maxas,rp(s,rs,a)[r+γvπ(s)]\pi'(s) = \argmax_a \sum_{s', r} p(s', r \mid s,a) [r + \gamma v_{\pi}(s')]
-rationale: if taking a better first action and then following π\pi improves value, that defines a better policy

Policy Iteration

-repeatedly alternate between policy evaluation and policy improvement
-each cycle improves the policy until convergence

Value Iteration

-similar to policy iteration, except policy evaluation is truncated to a single Bellman-optimality backup at each step
vk+1(s)=maxaE[Rt+1+γvk(St+1)St=s,At=a],=maxas,rp(s,rs,a)[r+γvk(s)]\begin{align*} v_{k+1}(s) &= \max_a \mathbb{E}[R_{t+1} + \gamma v_k(S_{t+1}) \mid S_t = s, A_t = a],\\ &= \max_a \sum_{s', r} p(s', r \mid s, a) [ r + \gamma v_k(s')] \end{align*}

Generalized Policy Iteration (GPI)

-almost all RL methods let policy evaluation and policy improvement interact
-always improve the value function toward the current policy
-always improve the policy with respect to the current value function