-dynamic programming (DP) algorithms compute optimal policies given a perfect MDP model of the environment
-this is uncommon in real RL, but DP provides the conceptual foundation
-DP algorithms come from turning Bellman equations into update rules for improving approximations to value functions
Bellman optimality equations
v∗(s)=amaxE[Rt+1+γv∗(St+1)∣St=s,At=a] q∗(s,a)=E[Rt+1+γa′maxq∗(St+1,a′)∣St=s,At=a] Policy Evaluation
policy evaluation - computing the state-value function for an arbitrary policy π
vπ(s)=E[Rt+1+γvπ(s′)∣St=s],=a∑π(a∣s)s′,r∑p(s′,r∣s,a)[r+γvπ(s′)] -if the environment dynamics are fully known, this is a system of ∣S∣ equations iterative policy evaluation - approximate vπ(s) with repeated updates under a fixed policy
vk+1(s)=a∑π(a∣s)s′,r∑p(s′,r∣s,a)[r+γvk(s′)],k→∞ Policy Improvement
-once you have a better estimate of vπ(s), you can improve the policy itself policy improvement theorem - let π′ and π be deterministic policies such that
qπ(s,π′(s))≥vπ(s),∀s∈S then
vπ′(s)≥vπ(s) Thus the new greedy policy is
π′(s)=aargmaxqπ(s,a) π′(s)=aargmaxs′,r∑p(s′,r∣s,a)[r+γvπ(s′)] -rationale: if taking a better first action and then following π improves value, that defines a better policy Policy Iteration
-repeatedly alternate between policy evaluation and policy improvement
-each cycle improves the policy until convergence
Value Iteration
-similar to policy iteration, except policy evaluation is truncated to a single Bellman-optimality backup at each step
vk+1(s)=amaxE[Rt+1+γvk(St+1)∣St=s,At=a],=amaxs′,r∑p(s′,r∣s,a)[r+γvk(s′)] Generalized Policy Iteration (GPI)
-almost all RL methods let policy evaluation and policy improvement interact
-always improve the value function toward the current policy
-always improve the policy with respect to the current value function