Proximal Policy Optimization Algorithms (Schulman et al.)

May 3, 2026

policy gradient methods - work by computing an estimator of policy gradient and plugging it into the SGA algorithm

-common estimator:

\hat{g} = \hat{\mathbb{E}}_t [\nabla_\theta \log \pi_\theta (a_t | s_t) \hat{A}_t]

\nabla_\theta \log \pi_\theta (a_t | s_t)

- "what update would increase the probability of the policy"

-advantage - improvement relative to baseline of a certain action

probability ratio - how likely is the new policy to take action $a_t$ compared to the old policy

r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\text{old}}(a_t | s_t)}

-vanilla policy gradients are unstable and sensitive to step size

-TRPO fixes this with "trust regions" but is complex and expensive

-need something simple, stable, and sample-efficient (can reuse data)

-don't move policy too far per step

-multiple gradient updates per batch

-stable advantage estimates (GAE)

-clipped surrogate objective - limits how much action probabilities change via ratio clipping

\min(r_t \hat{A}_t, \text{clip}(r_1, 1-\epsilon, 1+\epsilon) \hat{A}_t)

-uses

r_t

to compare new vs. old policy

TLDR:

PPO = stable policy gradient via clipped updates that prevent large policy shifts

-value function

V(s)

is learned using MSE against previous iterations (supervised)

-advantage = observed outcome - expected outcome

-deviation from expectation

-this is how we mathematically push the model towards the better answers (positive

\hat{A}_t

) and away from the worse ones (negative

\hat{A}_t

)