Proximal Policy Optimization Algorithms (Schulman et al.)
May 3, 2026
policy gradient methods - work by computing an estimator of policy gradient and plugging it into the SGA algorithm
-common estimator:
- - "what update would increase the probability of the policy"
-advantage - improvement relative to baseline of a certain action
probability ratio - how likely is the new policy to take action compared to the old policy
Motivation
-vanilla policy gradients are unstable and sensitive to step size
-TRPO fixes this with "trust regions" but is complex and expensive
-need something simple, stable, and sample-efficient (can reuse data)
Key Considerations
-don't move policy too far per step
-multiple gradient updates per batch
-stable advantage estimates (GAE)
Core Algorithm / Contributions
-clipped surrogate objective - limits how much action probabilities change via ratio clipping
-uses to compare new vs. old policy
TLDR:
PPO = stable policy gradient via clipped updates that prevent large policy shifts
-value function is learned using MSE against previous iterations (supervised)
-advantage = observed outcome - expected outcome
-deviation from expectation
-this is how we mathematically push the model towards the better answers (positive ) and away from the worse ones (negative )