Latest Research · Module 35·7 min read

DeepSeek-Math

Before R1 there was DeepSeek-Math: a series of papers that established the techniques — math-heavy pretraining, group-relative RL — that the broader reasoning-model wave would later run with.

Brain Drip EditorsUpdated May 2026·9 references

The five-bullet version

DeepSeek-Math (2024) was a 7B specialist model focused on mathematical reasoning, with state-of-the-art results at its size.
Two big ideas: large-scale math-heavy pretraining corpus, and GRPO (Group Relative Policy Optimization) for RL on math problems.
GRPO sidesteps the cost of training a value function — uses sampled-group statistics as the baseline instead.
The techniques generalized: DeepSeek-V3 / R1 reused them for general reasoning. So did Qwen-QwQ and many other reasoning models.
Math became the testbed because answers are verifiable — the reward signal is unambiguous.

§ 00 · MATH AS THE GROUND TRUTHWhy math problems are the right RL target

For RL to work, you need a reward signal. For general chat, reward usually comes from a learned reward model — trained on human preferences, with all the biases and gaps that implies. For math, you can skip the reward model entirely. The answer is either right or wrong. You can check.

That makes math an ideal testbed for RL on language models:

Verifiable rewards.Run the model’s output through an answer checker. Reward 1 if correct, 0 if not.
Hard enough to matter.A 7B base model doesn’t get math olympiad problems right by default — the training signal has somewhere to go.
Long-horizon reasoning required. Real math problems need multi-step chains. The model has reason to learn extended chain-of-thought.

§ 01 · PRETRAINING ON MATH-HEAVY DATAStep one

DeepSeek-MathDeepSeek-Math. A 2024 paper from DeepSeek introducing a 7B math-specialist model. Combined a carefully curated math-heavy pretraining corpus with a new RL algorithm (GRPO) to achieve state-of-the-art math reasoning at 7B scale, then later 67B. starts from a strong general base model, then continues pretraining on a math-heavy corpus — mathematical papers, problem sets, formalized math (Lean, Coq), and curated reasoning chains. Hundreds of billions of math-rich tokens.

Two lessons learned here mattered for later work:

Domain pretraining still matters at frontier scale. The model gains a meaningful capability boost from math-focused continued pretraining — not just from RL fine-tuning later.
Synthetic data is fine. Some of the corpus is synthetically generated by a larger model. As long as the synthetic data is high-quality (math is verifiable, so you can filter), it adds value.

§ 02 · GRPO — GROUP-RELATIVE RLThe algorithm

Standard RL for LLMs uses PPO with a learned value function (the “critic”). The critic estimates the expected reward at each step; PPO uses it to compute advantages. The critic has to be trained, takes memory, and is unstable.

GRPOGRPO. Group Relative Policy Optimization. An RL algorithm introduced in DeepSeek-Math. For each query, sample several completions; compute their rewards; use the in-group mean as the baseline (instead of training a separate value function). Simpler, more memory-efficient, works well on verifiable-reward tasks. replaces the critic with a sample-based baseline. For each query:

Sample Gcompletions (the “group”, typically G = 8 or 16).
Compute the reward for each one.
Subtract the mean of the group rewards from each — that’s the advantage estimate.
Update the policy to make above-average completions more likely and below-average less.

No critic to train. The baseline is just “how did the rest of the group do?” Memory cuts roughly in half (no value function), training stabilizes, and it works.

§ 03 · FROM DEEPSEEK-MATH TO R1Scaling the recipe

The DeepSeek-Math paper was at 7B scale. The techniques generalized:

DeepSeek-Math 67B — scaled the 7B paper.
DeepSeek-V3 — frontier general model using similar post-training.
DeepSeek-R1 — the breakout reasoning model. Famously trained an early variant (R1-Zero) with pure GRPO on verifiable rewards, no SFT at all. The model developed chain-of-thought reasoning purely from RL pressure.

R1 in particular made the field pay attention — partly because of the model’s quality, partly because the training recipe was released openly, including weights. Other labs replicated or extended the approach quickly.

§ 04 · WHAT THIS TAUGHT THE FIELDThree durable lessons

Verifiable rewards are an RL accelerant. When you can check the answer, you can train hard. Math and code are the obvious targets; other verifiable domains (formal proofs, theorem-proving) will follow.
Critic-free RL works. GRPO and its variants (REINFORCE++, RLOO, etc.) have shifted practical RL toward simpler, sample-based baselines.
Open recipes accelerate the field. DeepSeek releasing detailed methodology made it possible for other teams to reproduce results quickly. The open vs closed dynamic in 2025 felt different from earlier years.

CHECKA team wants to fine-tune a 7B base model for high-school math problems. They have a small budget and a verifier (correct-answer checker). Which approach is best?

§ 05 · TAKING THIS FORWARDRelated lessons

SFT vs RL covers the broader post-training landscape; Chain-of-Thought Monitoring covers what the visible reasoning that RL-on-verifiable-rewards produces actually gives us; Overthinking covers the failure mode at the long end of the chain.

§ · GOING DEEPERGRPO and the recipe behind reasoning models

DeepSeekMath (Shao et al. 2024) is the paper that introduced GRPO and demonstrated the recipe that DeepSeek-R1 would later scale up. The key ingredients: continued pretraining on a large math corpus (DeepSeekMath Corpus, 120B tokens), instruction tuning on math problem-answer pairs, and RL with a process-supervised reward model — score each step of a chain-of-thought, not just the final answer.

GRPO itself is the simplification that made it tractable. Instead of training a separate value function (the “critic”) as in PPO, sample G rollouts from the same prompt, normalize their rewards within the group, use the group-relative advantage as the gradient signal. No critic, much simpler implementation, and at the scales DeepSeek used it, comparable or better than PPO. R1 (2025) used GRPO with rule-based verifiable rewards (passing tests, correct math) and famously produced strong reasoning without supervised CoT training data.

§ · FURTHER READINGReferences & deeper sources

Shao et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models · arXiv
DeepSeek-AI (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model · arXiv
DeepSeek-AI (2024). DeepSeek-V3 Technical Report · arXiv
DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning · arXiv
Schulman et al. (2017). Proximal Policy Optimization Algorithms (PPO baseline) · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.