MDP - classical formalization of sequential decision making
-actions influence immediate rewards and subsequent states, and thus future rewards
-bandit problem: q∗(a) is an action value -MDP: q∗(s,a) is an action value conditioned on state -agent - the entity making decisions and learning
-environment - everything the agent interacts with that it cannot arbitrarily change
-at each time step t, the agent observes a state St∈S and selects an action At∈A(s) -as a consequence, it receives reward Rt+1∈R⊂R and transitions to state St+1 -Rt and St have discrete probability distributions determined by the preceding state and action p(s′,r∣s,a)=Pr(St=s′,Rt=r∣St−1=s,At−1=a) Markov property - the current state contains all information needed to predict the future
state-transition probabilities
p(s′∣s,a)=r∈R∑p(s′,r∣s,a) expected reward
r(s,a)=r∈R∑rs′∈S∑p(s′,r∣s,a) -state-action-next-state triples
r(s,a,s′)=r∈R∑rp(s′∣s,a)p(s′,r∣s,a) reward hypothesis - all goals can be framed as maximizing the expected cumulative reward
-rewards only encode the desired end state
Returns and Episodes
expected return Gt - the return of actions after time step t
-return is a function of the reward sequence, e.g. Rt+1+Rt+2+⋯+RT episode - a sequence of actions and events that starts in an initial state and ends in a terminal state
-each episode is independent
-episodic tasks are tasks with episodes
-S - nonterminal states, S+=S∪{terminal}, T - time of termination continuing tasks - non-episodic tasks, like ongoing control processes
-if T=∞, naive undiscounted return may be infinite discounting - present value of future rewards
-0≤γ<1 is the discount rate Gt=Rt+1+γRt+2+γ2Rt+3+⋯=i=0∑∞γiRt+i+1 -a reward received k steps in the future is worth γk−1 times its immediate value -γ=0 makes the agent myopic Unified Notation for Episodic and Continuing Tasks
-denote the terminal state as an absorbing state that transitions only to itself with reward 0
Policies and Value Functions
policy (π) - mapping from states to probabilities over actions
state-value function vπ(s) - expected return of starting in state s and following π
vπ(s)=Eπ[Gt∣St=s]=Eπ[k=0∑∞γkRt+k+1∣St=s] action-value function qπ(s,a) - expected return of taking action a in state s and then following π
qπ(s,a)=Eπ[Gt∣St=s,At=a] Bellman equation - relates the value of a state to the value of its successor states
vπ(s)=a∑π(a∣s)s′,r∑p(s′,r∣s,a)[r+γvπ(s′)] Optimal Policies and Optimal Value Functions
optimal policy - a policy that maximizes expected return
optimal state-value function
v∗(s)=πmaxvπ(s)=amaxE[Gt∣St=s,At=a] Using the Bellman optimality form:
v∗(s)=amaxE[Rt+1+γv∗(St+1)∣St=s,At=a] optimal action-value function
q∗(s,a)=πmaxqπ(s,a) q∗(s,a)=E[Rt+1+γa′maxq∗(St+1,a′)∣St=s,At=a]