Chapter 4: Dynamic Programming

June 8, 2026

-dynamic programming (DP) algorithms compute optimal policies given a perfect MDP model of the environment

-this is uncommon in real RL, but DP provides the conceptual foundation

-DP algorithms come from turning Bellman equations into update rules for improving approximations to value functions

Bellman optimality equations

v_*(s)= \max_a \mathbb{E} [R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t = s, A_t = a]

q_*(s, a) = \mathbb{E}[R_{t+1} + \gamma \max_{a'}q_*(S_{t+1}, a') \mid S_t = s, A_t = a]

policy evaluation - computing the state-value function for an arbitrary policy $\pi$

\begin{align*} v_{\pi}(s) &= \mathbb{E}[R_{t+1} + \gamma v_{\pi}(s') \mid S_t = s],\\ &= \sum_{a} \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_{\pi}(s')] \end{align*}

-if the environment dynamics are fully known, this is a system of

|\mathcal{S}|

equations

iterative policy evaluation - approximate $v_{\pi}(s)$ with repeated updates under a fixed policy

v_{k+1}(s) = \sum_{a} \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_k(s')], \quad k \rightarrow \infty

-once you have a better estimate of

v_{\pi}(s)

, you can improve the policy itself

policy improvement theorem - let $\pi'$ and $\pi$ be deterministic policies such that

q_{\pi}(s, \pi'(s)) \geq v_{\pi}(s), \quad \forall s \in \mathcal{S}

then

v_{\pi'}(s) \geq v_{\pi}(s)

Thus the new greedy policy is

\pi'(s) = \argmax_a q_{\pi}(s, a)

\pi'(s) = \argmax_a \sum_{s', r} p(s', r \mid s,a) [r + \gamma v_{\pi}(s')]

-rationale: if taking a better first action and then following

\pi

improves value, that defines a better policy

-repeatedly alternate between policy evaluation and policy improvement

-each cycle improves the policy until convergence

-similar to policy iteration, except policy evaluation is truncated to a single Bellman-optimality backup at each step

\begin{align*} v_{k+1}(s) &= \max_a \mathbb{E}[R_{t+1} + \gamma v_k(S_{t+1}) \mid S_t = s, A_t = a],\\ &= \max_a \sum_{s', r} p(s', r \mid s, a) [ r + \gamma v_k(s')] \end{align*}

-almost all RL methods let policy evaluation and policy improvement interact

-always improve the value function toward the current policy

-always improve the policy with respect to the current value function