Chapter 1: Introduction

June 8, 2026

reinforcement learning - how to map situations to actions to maximize a reward signal

-important aspects:

-trial-and-error search - discover which actions yield the highest reward by trying them

-delayed reward - actions may affect not only the immediate reward but also the next situation and later rewards

reinforcement learning method - any method well suited to solve problems modeled as Markov decision processes with sensation, action, and goal

supervised learning - use a labeled training set to extrapolate behavior to situations not present in the training data

-not adequate for learning from interaction, since it is impractical to obtain examples of desired behavior across all relevant scenarios

-an agent must learn from its own experience in uncharted territory

unsupervised learning - trying to uncover hidden structure

-reinforcement learning is often discussed alongside it, but RL is distinct because it is driven by reward-guided decision making

explore vs. exploit

-to obtain reward, RL agents must prefer actions that have effectively produced reward in the past

-to discover new rewarding actions, the agent also has to try actions it has not selected before

-neither can be done exclusively if the agent is to succeed

“weak methods” - methods based on general principles and heuristics like search or learning

“strong methods” - methods based on specific knowledge and facts

-1960s researchers believed enough special-purpose tricks and heuristics would make programs intelligent

1.policy - defines the agent’s way of behaving at a given time

-mapping from perceived states to actions taken from those states

-may be stochastic, specifying probabilities of each behavior

2.reward signal - quantifies the goal of the problem

-the sole goal is to maximize total reward

-immediate value

3.value function - quantifies what is good in the long run

-the total amount of reward an agent can expect to accumulate in the future, starting at the current state

-long-term value

4.environment model - mimics the behavior of the environment to enable inferences about how it will behave

-model-based methods use models and planning

-model-free methods are explicitly trial-and-error learners

evolutionary methods - apply multiple static policies, each interacting over an extended period of time with separate instances of an environment

-never estimate a value function

-policies that achieve the most reward, plus random variations of them, are carried over to the next generation

-similar to biological evolution

-examples include:

-genetic algorithms

-genetic programming

-simulated annealing

temporal-difference learning - changes are based on a difference between estimates at two successive times

V(S_t) \leftarrow V(S_t) + \alpha \left[ V(S_{t+1}) - V(S_t) \right]

\alpha

- step-size parameter that influences the rate of learning

-learning occurs during greedy exploitation of known good states, not only by exploring known suboptimal states