One-Line Summary: A clipped surrogate objective that approximates trust region constraints using only first-order optimization -- the dominant algorithm in modern reinforcement learning and the engine behind RLHF for large language models.
Prerequisites: Trust Region Methods (trust-region-methods.md), advantage estimation (advantage-estimation.md), actor-critic methods (actor-critic-methods.md), policy gradient theorem (policy-gradient-theorem.md), importance sampling basics.
What Is PPO?
Imagine TRPO is a careful surgeon performing a delicate operation with specialized instruments, constant monitoring, and precise measurements at every step. PPO is the experienced field medic who achieves nearly the same outcomes using simple, robust techniques that work reliably under any conditions. Both prevent catastrophic harm, but PPO does it with a fraction of the complexity.
PPO, introduced by Schulman et al. (2017), replaces TRPO's computationally expensive constrained optimization (conjugate gradients, line search, Fisher-vector products) with a simple clipped objective function that can be optimized with standard stochastic gradient descent. This simplicity, combined with strong empirical performance, has made PPO the default choice for virtually every major RL application -- from OpenAI Five (Dota 2) to InstructGPT and ChatGPT (RLHF), to robotic control and game playing.
How It Works
The Probability Ratio
Define the ratio between the new and old policy's probability of taking action in state :
When , this ratio equals 1. The standard surrogate objective from TRPO is:
Without any constraint, maximizing can lead to destructively large policy updates when deviates far from 1.
The PPO-Clip Objective
PPO constrains the ratio by clipping it:
where is the clipping parameter (default: 0.2). The operation selects the more pessimistic (conservative) estimate of the improvement. Here is how it works for the two cases:
When (the action was better than average):
- We want to increase , making .
- But is clipped at , so the objective flattens beyond that point.
- Gradient becomes zero for : no further incentive to increase probability.
When (the action was worse than average):
- We want to decrease , making .
- But is clipped at , so the objective flattens below that point.
- Gradient becomes zero for : no further incentive to decrease probability.
The clipping creates a "trust region" in probability ratio space: the policy can change, but only within the band per update. This is a first-order approximation to TRPO's KL divergence constraint.
The Full PPO Algorithm
- Collect a batch of trajectories using the current policy .
- Compute advantages using GAE (
advantage-estimation.md). - For epochs (typically 3-10) over the collected batch:
- Compute ratios and clipped surrogate loss .
- Compute value function loss: .
- Compute entropy bonus: .
- Update to maximize: .
- Set and return to step 1.
The key innovation is step 3: multiple gradient steps on the same batch. The clipping prevents the policy from moving too far from , so reusing the batch is safe. This dramatically improves sample efficiency compared to single-update methods like A2C.
PPO vs. TRPO: Simplicity Wins
| Aspect | TRPO | PPO |
|---|---|---|
| Constraint mechanism | KL divergence (exact) | Probability ratio clipping |
| Optimization | Conjugate gradient + line search | Standard SGD/Adam |
| Implementation | ~500 lines of specialized code | ~50 lines beyond A2C |
| Second-order info | Fisher-vector products | None needed |
| Multiple epochs per batch | No (constraint violations) | Yes (clipping protects) |
| Wall-clock speed | Slower per update | 3-5x faster per update |
| Performance | Strong | Equal or better |
PPO for RLHF (Language Models)
In Reinforcement Learning from Human Feedback, PPO is used to fine-tune a language model to maximize a learned reward model while staying close to a reference policy . The RLHF objective is:
PPO optimizes this by treating token generation as a sequential decision process. Each token is an action; the reward is assigned at the end of the sequence. The KL penalty prevents the model from drifting too far from the pretrained policy (reward hacking). PPO's stability under this setup is why it became the standard for aligning language models with human preferences.
Why It Matters
PPO is arguably the single most important RL algorithm of the past decade. It powers:
- OpenAI Five: Dota 2 at superhuman level (2018-2019).
- InstructGPT and ChatGPT: RLHF alignment of GPT models (2022-2023).
- Claude, Gemini: RLHF and RLAIF pipelines for language model alignment.
- OpenAI Gym / MuJoCo benchmarks: Default baseline for continuous control research.
- Robotic control: Dexterous manipulation policies via sim-to-real transfer.
Its combination of simplicity, stability, and strong performance makes it the first algorithm most practitioners reach for. When a new RL problem arises, PPO is typically the default starting point.
Key Technical Details
- Clipping parameter : 0.2 is the near-universal default. Smaller values (0.1) are more conservative; larger values (0.3) allow faster but riskier updates.
- Number of epochs : 3-10 per batch. More epochs extract more from each batch but risk overfitting to it. PPO for RLHF typically uses -.
- Minibatch size: The batch is usually split into minibatches (e.g., 32-512 transitions) for stochastic optimization within each epoch.
- GAE parameters: , are standard defaults.
- Value function clipping: Some implementations clip the value function update similarly to the policy, preventing large changes: .
- Learning rate: Typically for Atari, for MuJoCo, often with linear decay to zero over training. RLHF uses to .
- Advantage normalization: Advantages are normalized (zero mean, unit standard deviation) per minibatch. This is critical for stable training.
- Gradient clipping: Max gradient norm of 0.5 is standard to prevent spikes.
- The original paper also proposes PPO-Penalty (using adaptive KL penalty instead of clipping), but PPO-Clip dominates in practice.
Common Misconceptions
- "PPO clips the gradient." PPO clips the objective (the probability ratio), not the gradient. The gradient flows normally through the unclipped region and is zero in the clipped region. Gradient clipping (by max norm) is a separate, standard technique used alongside PPO.
- "PPO is an on-policy algorithm." It is quasi-on-policy. Within a single iteration, PPO reuses the same batch across epochs, which is a mild form of off-policy learning. The clipping ensures this reuse is safe.
- "PPO guarantees monotonic improvement." Unlike TRPO, PPO has no formal theoretical guarantee. The clipping is a heuristic approximation of the trust region. In practice, PPO can still occasionally degrade performance, though far less frequently than unconstrained methods.
- "PPO-Clip and PPO-Penalty are equivalent." They produce different behaviors. PPO-Clip is a hard constraint (objective is flat beyond the clip), while PPO-Penalty is a soft penalty (large changes are penalized but not prevented). Clip dominates practice.
- "A higher clip ratio means faster learning." Beyond , the "trust region" becomes too loose and training destabilizes. The optimal depends on the problem but 0.2 is robust across diverse settings.
Connections to Other Concepts
trust-region-methods.md-- PPO approximates TRPO's KL constraint with a simpler clipping mechanism. Understanding TRPO theory clarifies why PPO works.advantage-estimation.md-- PPO uses GAE to compute . The quality of advantage estimates directly affects PPO's performance.actor-critic-methods.md-- PPO is an actor-critic method: the policy is the actor and the value function is the critic.entropy-regularization.md-- The entropy bonus in PPO's loss prevents premature convergence.a2c-and-a3c.md-- PPO can be viewed as A2C with a clipped surrogate objective and multiple epochs per batch.reinforce.md-- PPO descends from REINFORCE through the lineage: REINFORCE -> Actor-Critic -> A2C -> TRPO -> PPO.
Further Reading
- Schulman et al. (2017), "Proximal Policy Optimization Algorithms" -- The original PPO paper. Compares clipping and penalty variants across Atari, MuJoCo, and roboschool tasks, demonstrating that PPO-Clip matches or exceeds TRPO performance with dramatically simpler implementation.
- Zheng et al. (2023), "Secrets of RLHF in Large Language Models Part I: PPO" -- Detailed analysis of PPO implementation for RLHF, covering reward model design, KL penalty tuning, and practical tricks for stable language model training.
- Engstrom et al. (2020), "Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO" -- Demonstrates that much of PPO's advantage over TRPO comes from code-level implementation details (reward normalization, advantage normalization, gradient clipping) rather than the clipped objective alone.
- Ouyang et al. (2022), "Training Language Models to Follow Instructions with Human Feedback" -- The InstructGPT paper, demonstrating PPO's central role in the RLHF pipeline that transformed modern language model alignment.