SFT vs RL
Two distinct ways to fine-tune a language model — supervised fine-tuning (imitate demonstrations) and reinforcement learning (maximize a reward). Modern post-training combines them in a specific order. Knowing why is half the picture of how good models are made.
The five-bullet version
- SFT (supervised fine-tuning): train the model to imitate human-written or curated demonstrations. Cross-entropy loss.
- RL (reinforcement learning): train the model to maximize a reward signal — usually from a reward model or rule-based scorer.
- Modern post-training is SFT first (teach the format and style), RL second (optimize for the quality signal).
- Variants: RLHF (reward model from human preferences), RLAIF (preferences from another AI), DPO (skips the explicit reward model), GRPO.
- The reasoning-model wave (o1, R1) showed RL with rule-based rewards on math/code can produce striking capability gains.
§ 00 · TWO WAYS TO TEACH A MODELAfter pretraining, then what?
Pretraining gives you a model that can predict the next token in a web crawl. It’s not yet useful — ask it a question and it might respond with a related question, or a Reddit comment, or a stack trace. Pretraining models behaviors. Post-training picks which behaviors to keep.
Two post-training paradigms:
- Supervised fine-tuning (SFT) — show the model examples of what you want it to do.
- Reinforcement learning (RL) — give the model a reward signal and let it optimize.
§ 01 · SFT — SHOW, THEN MEMORIZEImitation learning, applied to language
Supervised fine-tuningsupervised fine-tuning. Continue training a pretrained language model on a curated dataset of (input, desired-output) pairs, using the same next-token cross-entropy loss as pretraining. Teaches the model to follow instructions, match a tone, or produce a format. is the simplest post-training method. Gather a dataset of (instruction, ideal response) pairs. Continue training the pretrained model on those pairs, using the same next-token loss used in pretraining. The model learns to produce outputs that look like the demonstrations.
SFT is good at:
- Teaching format. JSON output, Markdown structure, specific reply shapes.
- Teaching style. Customer-service voice, in-character roleplay.
- Domain adaptation. Bake legal/medical vocabulary in.
SFT is limited because it can only teach the model to do what was shown. If your demonstrations are merely good, the model will plateau at “good.” If they have flaws (and human-written demos always do), the model imitates the flaws. SFT also can’t directly express preferences like “A is better than B” — only “A is right.”
§ 02 · RL — REWARD WHAT YOU LIKE, PUNISH WHAT YOU DON’TOptimizing toward a quality signal
Reinforcement learningreinforcement learning. Train the model to produce outputs that score well on a reward signal. The model generates candidate completions; each is scored; the model is updated to make high-scoring completions more likely. Doesn't require demonstrations — just a way to score outputs. for language models works the same way it does in robotics: the model takes an action (produces an output), gets a reward, and updates to make rewarded actions more likely.
The reward signal can come from several places:
- Reward model trained on human preferences (RLHF). Humans rank pairs of outputs (“A is better than B”). Train a small model to predict those rankings. Use it as the reward in RL.
- AI-judged preferences (RLAIF). Same as RLHF, but another LLM rates outputs instead of humans.
- Rule-based rewards.For tasks with verifiable answers (math, code), check the output against a known-correct answer or test suite. The reward is just “did you get it right?”
RL can teach things SFT can’t:
- Preferences between two okay options.Both are acceptable, one is better. RL picks up this signal; SFT can’t.
- Behaviors that no human wrote down. The model can discover output patterns that score well, without demonstrations.
- Capability beyond the demos. Famously, RLHF improves quality past the best human-written demonstrations, because the model can interpolate beyond them.
§ 03 · THE COMBINATION IS WHAT SHIPSSFT first, then RL
Every frontier model since GPT-3.5 has followed roughly this recipe:
- Pretraining. Predict the next token on the web. Months, billions of dollars.
- SFT. Teach the model to follow instructions and match a desired format/voice on tens to hundreds of thousands of curated demonstrations.
- RL. Optimize toward a quality signal — either human preferences (RLHF), AI preferences (RLAIF), or rule-based rewards. The model gets meaningfully better at the task on this step.
Step 2 teaches the model what good output looks like; step 3 teaches it what good output is. Both matter. SFT alone plateaus; RL alone is unstable without an SFT initialization.
§ 04 · WHAT CHANGED WITH REASONING MODELSThe o1 / R1 turn
Through 2023, the playbook above produced steadily better models. Then OpenAI’s o1 and DeepSeek’s R1 (early 2025) showed that RL with rule-based rewards on reasoning tasks — math, code, logic — could produce striking capability gains. The model learns to think longer before answering, because longer thinking gets more reward when the answer is verifiable.
The reasoning-model recipe inverts the usual emphasis:
- Less SFT— DeepSeek-R1’s early version skipped SFT entirely (“R1-Zero”). The model learned reasoning purely from RL on math and code, then SFT was added for formatting.
- RL on verifiable rewards — math answers can be checked exactly; code can be run against tests. No reward model required.
- Long chain-of-thought emerges. Without being told to, the models learn to write thousands of tokens of reasoning before answering, because that improves the rate of correct answers.
This has reshaped the field. Frontier reasoning models in 2026 — o-series, Claude with extended thinking, DeepSeek-R1, Qwen-QwQ — all use some variant of this approach. The traditional SFT-then-RLHF recipe is still standard for chat models; reasoning models have a different one.
§ 05 · TAKING THIS FORWARDRelated threads
The reasoning-model story continues in the DeepSeek-Math, Overthinking, and Chain-of-Thought Monitoring lessons. The post-training stack remains one of the most active areas of LLM research; expect variants to keep appearing.
§ · GOING DEEPERFrom InstructGPT to GRPO — the post-training story
InstructGPT (Ouyang et al. 2022) established the modern post-training recipe: supervised fine-tuning on human-demonstrated responses, then reinforcement learning from human feedback (RLHF) using PPO against a reward model trained on preference comparisons. RLHF was load-bearing — the chat-style behavior of ChatGPT came from the RL stage, not pretraining.
Two simplifications followed. DPO (Rafailov et al. 2023) showed you can fit the policy directly to preferences without an explicit reward model — same theoretical objective, much simpler implementation. GRPO (Shao et al. 2024, in the DeepSeekMath paper) replaced PPO’s value function with group-normalized advantages from multiple rollouts — used by DeepSeek-R1 to do RL on reasoning at scale without a critic. The frontier reasoning models are essentially SFT + GRPO on verifiable-reward tasks (math, code) plus RLHF for general behavior.
§ · FURTHER READINGReferences & deeper sources
- (2022). Training language models to follow instructions with human feedback (InstructGPT) · NeurIPS
- (2017). Deep Reinforcement Learning from Human Preferences · NeurIPS
- (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO) · NeurIPS
- (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (introduces GRPO) · arXiv
- (2022). Constitutional AI: Harmlessness from AI Feedback · arXiv
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.