Chapter 1: Introduction

June 8, 2026

reinforcement learning - how to map situations to actions to maximize a reward signal

-important aspects:
-trial-and-error search - discover which actions yield the highest reward by trying them
-delayed reward - actions may affect not only the immediate reward but also the next situation and later rewards

reinforcement learning method - any method well suited to solve problems modeled as Markov decision processes with sensation, action, and goal

supervised learning - use a labeled training set to extrapolate behavior to situations not present in the training data

-not adequate for learning from interaction, since it is impractical to obtain examples of desired behavior across all relevant scenarios
-an agent must learn from its own experience in uncharted territory

unsupervised learning - trying to uncover hidden structure

-reinforcement learning is often discussed alongside it, but RL is distinct because it is driven by reward-guided decision making

explore vs. exploit

-to obtain reward, RL agents must prefer actions that have effectively produced reward in the past
-to discover new rewarding actions, the agent also has to try actions it has not selected before
-neither can be done exclusively if the agent is to succeed

“weak methods” - methods based on general principles and heuristics like search or learning

“strong methods” - methods based on specific knowledge and facts

-1960s researchers believed enough special-purpose tricks and heuristics would make programs intelligent

4 Main Subelements of an RL System

1.policy - defines the agent’s way of behaving at a given time
-mapping from perceived states to actions taken from those states
-may be stochastic, specifying probabilities of each behavior
2.reward signal - quantifies the goal of the problem
-the sole goal is to maximize total reward
-immediate value
3.value function - quantifies what is good in the long run
-the total amount of reward an agent can expect to accumulate in the future, starting at the current state
-long-term value
4.environment model - mimics the behavior of the environment to enable inferences about how it will behave
-model-based methods use models and planning
-model-free methods are explicitly trial-and-error learners

evolutionary methods - apply multiple static policies, each interacting over an extended period of time with separate instances of an environment

-never estimate a value function
-policies that achieve the most reward, plus random variations of them, are carried over to the next generation
-similar to biological evolution
-examples include:
-genetic algorithms
-genetic programming
-simulated annealing

temporal-difference learning - changes are based on a difference between estimates at two successive times

V(St)V(St)+α[V(St+1)V(St)]V(S_t) \leftarrow V(S_t) + \alpha \left[ V(S_{t+1}) - V(S_t) \right]
-α\alpha - step-size parameter that influences the rate of learning
-learning occurs during greedy exploitation of known good states, not only by exploring known suboptimal states