Chapter 3: Finite Markov Decision Processes (MDPs)

June 8, 2026

MDP - classical formalization of sequential decision making

-actions influence immediate rewards and subsequent states, and thus future rewards
-bandit problem: q(a)q_*(a) is an action value
-MDP: q(s,a)q_*(s, a) is an action value conditioned on state
-agent - the entity making decisions and learning
-environment - everything the agent interacts with that it cannot arbitrarily change
-at each time step tt, the agent observes a state StSS_t \in \mathcal{S} and selects an action AtA(s)A_t \in \mathcal{A}(s)
-as a consequence, it receives reward Rt+1RRR_{t+1} \in \mathcal{R} \subset \mathbb{R} and transitions to state St+1S_{t+1}
-RtR_t and StS_t have discrete probability distributions determined by the preceding state and action
p(s,rs,a)=Pr(St=s,Rt=rSt1=s,At1=a)p(s', r \mid s, a) = \Pr(S_t = s', R_t = r \mid S_{t-1} = s, A_{t-1} = a)

Markov property - the current state contains all information needed to predict the future

state-transition probabilities

p(ss,a)=rRp(s,rs,a)p(s' \mid s, a) = \sum_{r \in \mathcal{R}} p(s', r \mid s, a)

expected reward

-state-action pairs
r(s,a)=rRrsSp(s,rs,a)r(s, a) = \sum_{r \in \mathcal{R}} r \sum_{s' \in \mathcal{S}} p(s', r \mid s, a)
-state-action-next-state triples
r(s,a,s)=rRrp(s,rs,a)p(ss,a)r(s, a, s') = \sum_{r \in \mathcal{R}} r \frac{p(s', r \mid s, a)}{p(s' \mid s, a)}

reward hypothesis - all goals can be framed as maximizing the expected cumulative reward

-rewards only encode the desired end state

Returns and Episodes

expected return GtG_t - the return of actions after time step tt

-return is a function of the reward sequence, e.g. Rt+1+Rt+2++RTR_{t+1} + R_{t+2} + \cdots + R_T

episode - a sequence of actions and events that starts in an initial state and ends in a terminal state

-each episode is independent
-episodic tasks are tasks with episodes
-S\mathcal{S} - nonterminal states, S+=S{terminal}\mathcal{S}^+ = \mathcal{S} \cup \{\text{terminal}\}, TT - time of termination

continuing tasks - non-episodic tasks, like ongoing control processes

-if T=T = \infty, naive undiscounted return may be infinite

discounting - present value of future rewards

-0γ<10 \leq \gamma < 1 is the discount rate
Gt=Rt+1+γRt+2+γ2Rt+3+=i=0γiRt+i+1G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{i=0}^{\infty} \gamma^i R_{t+i+1}
-a reward received kk steps in the future is worth γk1\gamma^{k-1} times its immediate value
-γ=0\gamma = 0 makes the agent myopic

Unified Notation for Episodic and Continuing Tasks

-denote the terminal state as an absorbing state that transitions only to itself with reward 0

Policies and Value Functions

policy (π\pi) - mapping from states to probabilities over actions

state-value function vπ(s)v_{\pi}(s) - expected return of starting in state ss and following π\pi

vπ(s)=Eπ[GtSt=s]=Eπ[k=0γkRt+k+1St=s]v_{\pi}(s) = \mathbb{E}_{\pi} [G_t \mid S_t = s] = \mathbb{E}_{\pi} \left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \mid S_t = s \right]

action-value function qπ(s,a)q_{\pi}(s, a) - expected return of taking action aa in state ss and then following π\pi

qπ(s,a)=Eπ[GtSt=s,At=a]q_{\pi}(s, a)= \mathbb{E}_{\pi}[G_t \mid S_t = s, A_t = a]

Bellman equation - relates the value of a state to the value of its successor states

vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)]v_{\pi}(s) = \sum_{a} \pi(a \mid s)\sum_{s', r} p(s', r \mid s, a) \left[r + \gamma v_{\pi}(s')\right]

Optimal Policies and Optimal Value Functions

optimal policy - a policy that maximizes expected return

optimal state-value function

v(s)=maxπvπ(s)=maxaE[GtSt=s,At=a]v_*(s) = \max_{\pi} v_{\pi}(s) = \max_a \mathbb{E}[G_t \mid S_t = s, A_t = a]

Using the Bellman optimality form:

v(s)=maxaE[Rt+1+γv(St+1)St=s,At=a]v_*(s) = \max_a \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t = s, A_t = a]

optimal action-value function

q(s,a)=maxπqπ(s,a)q_*(s, a) = \max_{\pi} q_{\pi}(s, a)
q(s,a)=E[Rt+1+γmaxaq(St+1,a)St=s,At=a]q_*(s, a) = \mathbb{E}[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a') \mid S_t = s, A_t = a]