Proximal Policy Optimization Algorithms (Schulman et al.)

May 3, 2026

policy gradient methods - work by computing an estimator of policy gradient and plugging it into the SGA algorithm

-common estimator: g^=E^t[θlogπθ(atst)A^t]\hat{g} = \hat{\mathbb{E}}_t [\nabla_\theta \log \pi_\theta (a_t | s_t) \hat{A}_t]
-θlogπθ(atst)\nabla_\theta \log \pi_\theta (a_t | s_t) - "what update would increase the probability of the policy"
-advantage - improvement relative to baseline of a certain action

probability ratio - how likely is the new policy to take action ata_t compared to the old policy

rt(θ)=πθ(atst)πold(atst)r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\text{old}}(a_t | s_t)}

Motivation

-vanilla policy gradients are unstable and sensitive to step size
-TRPO fixes this with "trust regions" but is complex and expensive
-need something simple, stable, and sample-efficient (can reuse data)

Key Considerations

-don't move policy too far per step
-multiple gradient updates per batch
-stable advantage estimates (GAE)

Core Algorithm / Contributions

-clipped surrogate objective - limits how much action probabilities change via ratio clipping
min(rtA^t,clip(r1,1ϵ,1+ϵ)A^t)\min(r_t \hat{A}_t, \text{clip}(r_1, 1-\epsilon, 1+\epsilon) \hat{A}_t)
-uses rtr_t to compare new vs. old policy

TLDR:

PPO = stable policy gradient via clipped updates that prevent large policy shifts

-value function V(s)V(s) is learned using MSE against previous iterations (supervised)
-advantage = observed outcome - expected outcome
-deviation from expectation
-this is how we mathematically push the model towards the better answers (positive A^t\hat{A}_t) and away from the worse ones (negative A^t\hat{A}_t)