One-Line Summary: GRPO is a reinforcement learning algorithm developed by DeepSeek that eliminates the critic (value) model entirely by estimating advantages through group-based relative scoring of multiple sampled outputs -- dramatically reducing memory requirements while achieving stable, effective policy optimization.
Prerequisites: Understanding of RLHF and PPO (policy, reward, advantage estimation, clipped objectives), KL divergence as a regularizer, reward modeling, and basic statistics (mean, standard deviation, z-scores).
What Is GRPO?
Standard PPO in the RLHF pipeline requires four models in memory simultaneously: the policy, the reference model, the reward model, and a critic (value) model that estimates how good each state is. The critic is essential for computing "advantages" -- how much better an action was compared to what was expected. But training this critic is itself unstable, memory-intensive, and adds another source of error.
flowchart LR
S1["group sampling"]
S2["z-score advantage estimation"]
S3["clipped policy update"]
S1 --> S2
S2 --> S3GRPO asks: what if we could estimate advantages without a critic at all?
The key insight is surprisingly simple. Instead of training a neural network to predict expected rewards, GRPO samples a group of outputs for each prompt and uses the group's own statistics as the baseline. Think of it like grading on a curve: instead of having an external judge estimate what a "good" score should be, you simply compare each student's performance against the class average. If a response scored above the group mean, it gets a positive advantage; below the mean, negative. No external critic needed.
This approach draws from a long lineage in RL -- REINFORCE with baselines, self-play, and rejection sampling -- but packages it into a practical, scalable algorithm that proved powerful enough to train DeepSeek-R1, one of the first models to develop emergent chain-of-thought reasoning purely from reinforcement learning.
How It Works
flowchart LR
subgraph L1["Comparison of PPO (with critic)"]
LI3["Comparison of PPO (with critic)"]
end
subgraph R2["GRPO (critic-free group-based advantage)"]
RI4["Feature 1"]
endGroup-Based Advantage Estimation
For each prompt , GRPO samples a group of outputs from the current policy . Each output is scored by the reward model (or rule-based reward function), producing rewards . The advantage for each output is computed as a z-score:
This is the core innovation. No learned value function, no critic network, no temporal difference learning. The group itself serves as the baseline.
A typical group size is , meaning 64 completions are sampled per prompt per training step. The z-score normalization ensures advantages are zero-mean and unit-variance within each group, which stabilizes gradient magnitudes across prompts that may have very different reward scales.
The variance of this estimator decreases as . With , the standard error of the mean is about 12.5% of the standard deviation -- sufficiently precise for stable policy gradient updates.
The GRPO Objective
Like PPO, GRPO uses a clipped surrogate objective to prevent overly large policy updates:
The clipping ratio (typically 0.2) prevents the policy from changing too drastically in a single update. The KL divergence penalty against the reference policy prevents the model from drifting too far from its starting point.
Token-Level vs. Sequence-Level Advantages
In standard PPO for language models, the critic estimates a value at every token position, enabling per-token advantage computation via Generalized Advantage Estimation (GAE).
GRPO takes a different approach: it assigns a single advantage to the entire sequence and applies it uniformly to every token during the policy gradient update. While per-token advantages provide more fine-grained signal, they require a well-calibrated critic -- exactly what GRPO eliminates.
In practice, sequence-level advantages with large group sizes provide sufficient signal. The policy gradient still differentially upweights tokens in high-advantage sequences and downweights tokens in low-advantage sequences. The per-sequence signal gets refined across many training steps.
Reward Design for GRPO
The choice of reward function is particularly important because group-based advantage estimation requires meaningful variance in rewards across the group. If all outputs receive similar rewards, advantages will be near-zero and learning stalls.
DeepSeek-R1-Zero used carefully designed rule-based rewards:
- Correctness reward: Binary 1/0 for math (exact answer match), or partial credit based on solution structure
- Format compliance: Reward for placing answers within designated tags (e.g.,
<answer>...</answer>) - Length penalty: Negative reward for excessively long or repetitive outputs
The Training Loop in Practice
A single GRPO training iteration proceeds as follows:
- Sample a batch of prompts from the training set.
- For each prompt, generate complete responses using the current policy.
- Score each response using the reward model or rule-based reward function.
- Compute z-score normalized advantages within each group.
- Compute the clipped policy gradient loss across all prompt-response pairs.
- Add the KL divergence penalty against the reference policy.
- Update policy parameters. Optionally repeat steps 5-7 for multiple epochs on the same batch.
- Periodically update the reference policy (or keep it fixed throughout training).
Why It Matters
- Memory efficiency: GRPO requires roughly half the memory of PPO-based RLHF because it eliminates the critic model entirely. For a 70B parameter policy, this saves ~140GB of GPU memory (critic weights plus optimizer states).
- Training stability: Critic networks in PPO are a major source of instability -- poorly calibrated critics, reward scale sensitivity, and interacting training dynamics. GRPO sidesteps all of these.
- Emergent reasoning: DeepSeek-R1-Zero, trained with GRPO using only rule-based rewards, spontaneously developed chain-of-thought reasoning, self-verification ("let me check my work"), and "aha moments" without any supervised demonstrations.
- Simplicity: GRPO requires fewer hyperparameters than PPO with GAE (no GAE lambda, no value function learning rate, no value loss coefficient).
- Scalability: Larger group sizes give better advantage estimates, and the sampling of outputs per prompt is embarrassingly parallel across GPUs.
Key Technical Details
- Group size: is standard. Values from 16 to 256 have been explored. Larger groups provide lower-variance estimates but cost more compute.
- DeepSeekMath results: 7B model achieved 58.8% on MATH and 88.2% on GSM8K.
- DeepSeek-R1-Zero: Pure RL with GRPO, no SFT stage -- reasoning emerged from RL alone with rule-based rewards.
- KL penalty coefficient : Typically 0.01-0.04. Too low permits reward hacking; too high prevents learning.
- Multiple PPO epochs: GRPO can reuse the same sampled group for 2-4 gradient updates before resampling.
- Reward compatibility: Works with both learned reward models and hand-crafted reward functions.
- Sampling temperature: Higher temperatures during group sampling increase output diversity, providing better advantage estimates.
Common Misconceptions
- "GRPO is just REINFORCE with a baseline." GRPO adds PPO's clipped objective, explicit KL regularization, and z-score normalization. The combination is far more stable than vanilla REINFORCE.
- "Eliminating the critic must sacrifice learning quality." In practice, GRPO matches or exceeds PPO performance. The critic in standard PPO introduces its own errors and instabilities.
- "Sequence-level advantages lose too much information." For tasks with holistic rewards (correctness, helpfulness), sequence-level advantages are a natural fit.
- "GRPO only works with rule-based rewards." It works equally well with learned reward models. The reward source is independent of the advantage estimation method.
- "Larger group sizes are always better." Returns diminish beyond ; the marginal variance reduction rarely justifies the additional sampling cost.
Connections to Other Concepts
ppo-for-language-models.md: GRPO inherits PPO's clipped objective but replaces the critic with group-based advantage estimation.reinforce.md: The intellectual ancestor -- using sampled returns rather than a learned critic. GRPO improves on it with clipping and KL regularization.rejection-sampling.md: GRPO's group sampling is related, but uses all samples for policy gradients rather than just the best one.rlhf.md: GRPO is a drop-in replacement for PPO in the RLHF or RLVR pipeline.chain-of-thought-training.md: DeepSeek-R1-Zero's emergent reasoning connects GRPO to the study of how reasoning is elicited through RL.
Further Reading
- "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (Shao et al., 2024, arXiv:2402.03300) -- Introduces GRPO and demonstrates its effectiveness for mathematical reasoning.
- "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (DeepSeek-AI, 2025, arXiv:2501.12948) -- Shows GRPO with rule-based rewards producing emergent chain-of-thought reasoning.
- "Proximal Policy Optimization Algorithms" (Schulman et al., 2017) -- Essential background for understanding GRPO's clipped objective and trust-region approach.