Latest Research · Module 35·7 min read

DeepSeek-Math

Before R1 there was DeepSeek-Math: a series of papers that established the techniques — math-heavy pretraining, group-relative RL — that the broader reasoning-model wave would later run with.

The five-bullet version

  • DeepSeek-Math (2024) was a 7B specialist model focused on mathematical reasoning, with state-of-the-art results at its size.
  • Two big ideas: large-scale math-heavy pretraining corpus, and GRPO (Group Relative Policy Optimization) for RL on math problems.
  • GRPO sidesteps the cost of training a value function — uses sampled-group statistics as the baseline instead.
  • The techniques generalized: DeepSeek-V3 / R1 reused them for general reasoning. So did Qwen-QwQ and many other reasoning models.
  • Math became the testbed because answers are verifiable — the reward signal is unambiguous.

§ 00 · MATH AS THE GROUND TRUTHWhy math problems are the right RL target

For RL to work, you need a reward signal. For general chat, reward usually comes from a learned reward model — trained on human preferences, with all the biases and gaps that implies. For math, you can skip the reward model entirely. The answer is either right or wrong. You can check.

That makes math an ideal testbed for RL on language models:

§ 01 · PRETRAINING ON MATH-HEAVY DATAStep one

DeepSeek-MathDeepSeek-Math. A 2024 paper from DeepSeek introducing a 7B math-specialist model. Combined a carefully curated math-heavy pretraining corpus with a new RL algorithm (GRPO) to achieve state-of-the-art math reasoning at 7B scale, then later 67B. starts from a strong general base model, then continues pretraining on a math-heavy corpus — mathematical papers, problem sets, formalized math (Lean, Coq), and curated reasoning chains. Hundreds of billions of math-rich tokens.

Two lessons learned here mattered for later work:

§ 02 · GRPO — GROUP-RELATIVE RLThe algorithm

Standard RL for LLMs uses PPO with a learned value function (the “critic”). The critic estimates the expected reward at each step; PPO uses it to compute advantages. The critic has to be trained, takes memory, and is unstable.

GRPOGRPO. Group Relative Policy Optimization. An RL algorithm introduced in DeepSeek-Math. For each query, sample several completions; compute their rewards; use the in-group mean as the baseline (instead of training a separate value function). Simpler, more memory-efficient, works well on verifiable-reward tasks. replaces the critic with a sample-based baseline. For each query:

  1. Sample Gcompletions (the “group”, typically G = 8 or 16).
  2. Compute the reward for each one.
  3. Subtract the mean of the group rewards from each — that’s the advantage estimate.
  4. Update the policy to make above-average completions more likely and below-average less.

No critic to train. The baseline is just “how did the rest of the group do?” Memory cuts roughly in half (no value function), training stabilizes, and it works.

§ 03 · FROM DEEPSEEK-MATH TO R1Scaling the recipe

The DeepSeek-Math paper was at 7B scale. The techniques generalized:

R1 in particular made the field pay attention — partly because of the model’s quality, partly because the training recipe was released openly, including weights. Other labs replicated or extended the approach quickly.

§ 04 · WHAT THIS TAUGHT THE FIELDThree durable lessons

CHECKA team wants to fine-tune a 7B base model for high-school math problems. They have a small budget and a verifier (correct-answer checker). Which approach is best?

§ 05 · TAKING THIS FORWARDRelated lessons

SFT vs RL covers the broader post-training landscape; Chain-of-Thought Monitoring covers what the visible reasoning that RL-on-verifiable-rewards produces actually gives us; Overthinking covers the failure mode at the long end of the chain.

§ · GOING DEEPERGRPO and the recipe behind reasoning models

DeepSeekMath (Shao et al. 2024) is the paper that introduced GRPO and demonstrated the recipe that DeepSeek-R1 would later scale up. The key ingredients: continued pretraining on a large math corpus (DeepSeekMath Corpus, 120B tokens), instruction tuning on math problem-answer pairs, and RL with a process-supervised reward model — score each step of a chain-of-thought, not just the final answer.

GRPO itself is the simplification that made it tractable. Instead of training a separate value function (the “critic”) as in PPO, sample G rollouts from the same prompt, normalize their rewards within the group, use the group-relative advantage as the gradient signal. No critic, much simpler implementation, and at the scales DeepSeek used it, comparable or better than PPO. R1 (2025) used GRPO with rule-based verifiable rewards (passing tests, correct math) and famously produced strong reasoning without supervised CoT training data.

§ · FURTHER READINGReferences & deeper sources

  1. Shao et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models · arXiv
  2. DeepSeek-AI (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model · arXiv
  3. DeepSeek-AI (2024). DeepSeek-V3 Technical Report · arXiv
  4. DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning · arXiv
  5. Schulman et al. (2017). Proximal Policy Optimization Algorithms (PPO baseline) · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.