Chapter 3: Finite Markov Decision Processes (MDPs)

June 8, 2026

MDP - classical formalization of sequential decision making

-actions influence immediate rewards and subsequent states, and thus future rewards

-bandit problem:

q_*(a)

is an action value

-MDP:

q_*(s, a)

is an action value conditioned on state

-agent - the entity making decisions and learning

-environment - everything the agent interacts with that it cannot arbitrarily change

-at each time step

t

, the agent observes a state

S_t \in \mathcal{S}

and selects an action

A_t \in \mathcal{A}(s)

-as a consequence, it receives reward

R_{t+1} \in \mathcal{R} \subset \mathbb{R}

and transitions to state

S_{t+1}

R_t

and

S_t

have discrete probability distributions determined by the preceding state and action

p(s', r \mid s, a) = \Pr(S_t = s', R_t = r \mid S_{t-1} = s, A_{t-1} = a)

Markov property - the current state contains all information needed to predict the future

state-transition probabilities

p(s' \mid s, a) = \sum_{r \in \mathcal{R}} p(s', r \mid s, a)

expected reward

-state-action pairs

r(s, a) = \sum_{r \in \mathcal{R}} r \sum_{s' \in \mathcal{S}} p(s', r \mid s, a)

-state-action-next-state triples

r(s, a, s') = \sum_{r \in \mathcal{R}} r \frac{p(s', r \mid s, a)}{p(s' \mid s, a)}

reward hypothesis - all goals can be framed as maximizing the expected cumulative reward

-rewards only encode the desired end state

Returns and Episodes

expected return $G_t$ - the return of actions after time step $t$

-return is a function of the reward sequence, e.g.

R_{t+1} + R_{t+2} + \cdots + R_T

episode - a sequence of actions and events that starts in an initial state and ends in a terminal state

-each episode is independent

-episodic tasks are tasks with episodes

\mathcal{S}

- nonterminal states,

\mathcal{S}^+ = \mathcal{S} \cup \{\text{terminal}\}

T

- time of termination

continuing tasks - non-episodic tasks, like ongoing control processes

-if

T = \infty

, naive undiscounted return may be infinite

discounting - present value of future rewards

0 \leq \gamma < 1

is the discount rate

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{i=0}^{\infty} \gamma^i R_{t+i+1}

-a reward received

k

steps in the future is worth

\gamma^{k-1}

times its immediate value

\gamma = 0

makes the agent myopic

Unified Notation for Episodic and Continuing Tasks

-denote the terminal state as an absorbing state that transitions only to itself with reward 0

Policies and Value Functions

policy ( $\pi$ ) - mapping from states to probabilities over actions

state-value function $v_{\pi}(s)$ - expected return of starting in state $s$ and following $\pi$

v_{\pi}(s) = \mathbb{E}_{\pi} [G_t \mid S_t = s] = \mathbb{E}_{\pi} \left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \mid S_t = s \right]

action-value function $q_{\pi}(s, a)$ - expected return of taking action $a$ in state $s$ and then following $\pi$

q_{\pi}(s, a)= \mathbb{E}_{\pi}[G_t \mid S_t = s, A_t = a]

Bellman equation - relates the value of a state to the value of its successor states

v_{\pi}(s) = \sum_{a} \pi(a \mid s)\sum_{s', r} p(s', r \mid s, a) \left[r + \gamma v_{\pi}(s')\right]

Optimal Policies and Optimal Value Functions

optimal policy - a policy that maximizes expected return

optimal state-value function

v_*(s) = \max_{\pi} v_{\pi}(s) = \max_a \mathbb{E}[G_t \mid S_t = s, A_t = a]

Using the Bellman optimality form:

v_*(s) = \max_a \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t = s, A_t = a]

optimal action-value function

q_*(s, a) = \max_{\pi} q_{\pi}(s, a)

q_*(s, a) = \mathbb{E}[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a') \mid S_t = s, A_t = a]