PPO for Language Models

One-Line Summary: Adapting Proximal Policy Optimization from game environments to text generation -- where actions are tokens, episodes are sequences, rewards arrive only at the end, and four full-sized neural networks must coexist in GPU memory.

Prerequisites: Proximal Policy Optimization (04-policy-gradient-methods/proximal-policy-optimization.md), advantage estimation (04-policy-gradient-methods/advantage-estimation.md), actor-critic methods (04-policy-gradient-methods/actor-critic-methods.md), KL divergence, language model basics (autoregressive generation).

What Is PPO for Language Models?

Imagine PPO was originally designed to teach a robot to walk -- it takes a step, observes the ground, decides on the next movement, and gets continuous feedback about balance and forward progress. Now imagine instead teaching someone to write a poem: they must choose one word at a time from a vocabulary of 50,000+ options, they receive no feedback until the entire poem is finished, and "good" is defined not by physics but by subjective human taste. That is the challenge of applying PPO to language models.

In the language model setting, PPO optimizes an autoregressive policy $π_{θ}$ (the LLM) that generates text token by token. Each token selection is an action, the growing sequence is the state, and the reward comes only after the full response is generated. This creates a credit assignment problem far more extreme than typical RL: a 500-token response means 500 sequential decisions, but only one reward signal at the end.

How It Works

Text Generation as a Markov Decision Process

The text generation process is cast as an MDP:

State $s_{t}$ : The prompt $x$ concatenated with all tokens generated so far $(y_{1}, \dots, y_{t - 1})$
Action $a_{t}$ : The next token $y_{t}$ selected from vocabulary $V$ (typically $∣ V ∣ \approx 32, 000$ -- $128, 000$ )
Transition: Deterministic -- appending the chosen token to the sequence
Reward: $r_{t} = 0$ for $t < T$ , and $r_{T} = r_{ϕ} (x, y_{1 : T}) - β \sum_{t = 1}^{T} lo g \frac{π _{θ} ( y _{t} ∣ s _{t} )}{π _{ref} ( y _{t} ∣ s _{t} )}$ at the final token

The per-token KL penalty can also be distributed across timesteps rather than applied at the end:

$r_{t} = {- β lo g \frac{π _{θ} ( y _{t} ∣ s _{t} )}{π _{ref} ( y _{t} ∣ s _{t} )} r_{ϕ} (x, y_{1 : T}) - β lo g \frac{π _{θ} ( y _{t} ∣ s _{t} )}{π _{ref} ( y _{t} ∣ s _{t} )} t < T t = T$

The PPO Objective for LLMs

The clipped surrogate objective from standard PPO applies directly:

$L^{CLIP} (θ) = E_{t} [min (ρ_{t} (θ) \hat{A}_{t}, clip (ρ_{t} (θ), 1 - ϵ, 1 + ϵ) \hat{A}_{t})]$

where $ρ_{t} (θ) = \frac{π _{θ} ( y _{t} ∣ s _{t} )}{π _{θ_{old}} ( y _{t} ∣ s _{t} )}$ is the importance sampling ratio and $\hat{A}_{t}$ is the estimated advantage using GAE:

$\hat{A}_{t} = \sum_{l = 0}^{T - t} (γ λ)^{l} δ_{t + l}, δ_{t} = r_{t} + γ V_{ψ} (s_{t + 1}) - V_{ψ} (s_{t})$

The Four-Model Architecture

PPO for LLMs requires four models in GPU memory simultaneously:

Active policy $π_{θ}$ : The LLM being optimized (generates responses, gets updated)
Reference policy $π_{ref}$ : A frozen copy of $π_{SFT}$ (computes KL penalty, never updated)
Reward model $r_{ϕ}$ : Predicts human preference scores (frozen during PPO)
Value function $V_{ψ}$ : Estimates expected future return for advantage computation (trained alongside the policy)

For a 7B parameter policy, this means roughly 4 $\times$ 7B = 28B parameters in memory (the value head is typically a small addition to a copy of the policy network). This massive memory footprint is a primary engineering constraint and motivator for alternatives like DPO and GRPO.

The Generation-Training Loop

Unlike standard RL where environment steps are cheap, each "environment step" in LLM PPO involves full autoregressive generation:

Sample prompts from dataset $D$
Generate responses from $π_{θ}$ (expensive: sequential token generation)
Score responses with $r_{ϕ}$ (single forward pass per response)
Compute KL penalties using $π_{ref}$ (forward pass per response)
Estimate advantages using $V_{ψ}$ and GAE
Update $π_{θ}$ and $V_{ψ}$ with multiple PPO epochs on the collected batch

Steps 2--6 repeat for each batch. Generation (step 2) often dominates wall-clock time.

Why It Matters

PPO for language models is the optimization engine behind ChatGPT, Claude, and most commercial LLM assistants. Despite its complexity, it remains the most well-validated approach for converting human preference signals into policy improvements. The specific adaptations -- distributing KL penalties, handling sparse terminal rewards, managing four-model memory -- represent hard-won engineering knowledge that determines whether RLHF succeeds or produces degenerate outputs.

Key Technical Details

Clip parameter: $ϵ = 0.2$ is standard, sometimes reduced to $0.1$ for LLMs to enforce more conservative updates.
Learning rate: Typically $1 \times 1 0^{- 6}$ to $5 \times 1 0^{- 6}$ , an order of magnitude lower than SFT learning rates.
Batch size: 64--512 prompts per batch, with each prompt generating a response of 256--2048 tokens.
PPO epochs: 1--4 optimization epochs per batch of generated data. More epochs risk overfitting to stale data.
GAE parameters: $γ = 1.0$ (no discounting within a response) and $λ = 0.95$ are common starting points.
KL coefficient: $β \approx 0.01$ -- $0.2$ . Some implementations use adaptive KL targeting, increasing $β$ when KL exceeds a target and decreasing when below.
Value function initialization: $V_{ψ}$ is typically initialized from the SFT model with a scalar output head. Poor value function initialization causes large advantage estimation errors early in training.
Gradient accumulation: With large models, effective batch sizes are achieved through gradient accumulation across micro-batches.

Common Misconceptions

"PPO for LLMs is the same algorithm as PPO for Atari." The algorithm is mathematically identical, but the engineering is radically different. The action space (50K+ discrete tokens vs. ~18 Atari actions), episode length (hundreds of tokens vs. thousands of frames), reward structure (terminal only vs. dense), and memory requirements (four LLMs vs. small CNNs) create fundamentally different challenges.
"The KL penalty is just regularization." It is regularization, but it also prevents a specific failure mode: without it, the policy rapidly finds adversarial inputs to the reward model -- sequences that score highly but are nonsensical or degenerate. The KL penalty is as much about reward model robustness as policy stability.
"More PPO training always improves quality." There is a characteristic "alignment tax" curve: quality improves rapidly, plateaus, and then degrades as the model begins to overoptimize against the reward model. Knowing when to stop is critical.
"The value function is optional." Without $V_{ψ}$ , you must use REINFORCE-style Monte Carlo returns, which have much higher variance. The value function baseline is essential for stable training, though alternatives like GRPO eliminate it through different means.

Connections to Other Concepts

rlhf-pipeline.md -- PPO is Stage 3 of the RLHF pipeline; this file details the specific mechanics.
reward-modeling-for-llms.md -- The reward model $r_{ϕ}$ that provides the training signal for PPO.
grpo.md -- An alternative that eliminates the value function $V_{ψ}$ entirely through group-relative advantages.
dpo-as-implicit-rl.md -- Eliminates the PPO loop entirely by deriving the optimal policy in closed form.
04-policy-gradient-methods/proximal-policy-optimization.md -- The base PPO algorithm before LLM-specific adaptations.
04-policy-gradient-methods/advantage-estimation.md -- GAE, the advantage estimation method used within PPO for LLMs.
04-policy-gradient-methods/entropy-regularization.md -- The KL penalty in RLHF is conceptually related to entropy regularization in standard RL.