Knowledge Distillation for LLMs

One-Line Summary: Knowledge distillation evolved from compressing BERT-era models by mimicking output probabilities to a modern paradigm where large "teacher" models generate entire synthetic training datasets -- including reasoning traces -- that transfer intelligence through data rather than architecture mimicry.

Prerequisites: 03-bert.md, 07-gpt-4.md, 02-kaplan-scaling-laws.md, 03-deepseek-r1.md

What Is Knowledge Distillation for LLMs?

Imagine a master chef who has spent decades perfecting their craft. Traditional apprenticeship means watching the chef cook and trying to replicate their exact movements -- the angle of the knife, the timing of the flip. But a more effective approach might be for the master chef to write a comprehensive cookbook: not just recipes, but the reasoning behind every decision, the principles that generalize across dishes, the mistakes to avoid. The apprentice who studies this cookbook can become an excellent chef without ever matching the master's exact motions. That is the evolution of knowledge distillation in the LLM era -- from mimicking a teacher model's outputs to learning from the intelligence embedded in data the teacher creates.

The concept of knowledge distillation in neural networks dates back to Hinton, Vartia, and Dean's 2015 paper "Distilling the Knowledge in a Neural Network," which showed that a small "student" network could learn from the soft probability outputs of a large "teacher" network rather than from hard labels alone. The soft outputs carry richer information -- a teacher that assigns 70% probability to "cat" and 20% to "kitten" teaches the student about the relationship between those concepts, not just the correct answer.

But the LLM era demanded a fundamental rethinking of distillation. When your teacher model has hundreds of billions of parameters and your student has a few billion, simply matching output distributions isn't enough. The gap is too large, the task space too complex. What emerged instead was a radically different approach: using the teacher to generate high-quality training data that the student learns from directly. This "distillation through data" became the dominant paradigm by 2024-2025.

How It Works

The three eras of knowledge distillation -- from output mimicry to reasoning transfer:

Era 1: Classical (2019)         Era 2: Data-Based (2023)        Era 3: Reasoning (2025)
Match teacher outputs           Teacher generates data           Teacher generates thinking
 
┌────────┐  soft    ┌────────┐  ┌────────┐  52K    ┌────────┐  ┌────────┐ reasoning ┌────────┐
│ BERT   │──probs──▶│Distil- │  │ GPT-3.5│──inst.──▶│ Alpaca │  │ R1     │──traces──▶│R1-Dist.│
│ 110M   │  match   │ BERT   │  │(Teacher│  pairs  │ 7B     │  │ 671B   │  (chains  │ 32B    │
│(Teacher│          │ 66M    │  │  API)  │         │        │  │(Teacher│  of thought│        │
│)       │          │(Student│  └────────┘         │(Student│  │)       │  examples) │(Student│
└────────┘          │)       │                     │)       │  └────────┘           │)       │
                    └────────┘                     └────────┘                       └────────┘
97% of BERT         ~ChatGPT-like quality          Beats o1-mini on math!
at 60% size         for ~$600                      (72.6% vs 63.6% AIME)
 
Cost to create:     Cost to create:                Cost to create:
Research lab        $600 (API + compute)            Minimal (SFT on traces)
                    Anyone can do this!             Open weights available

Era 1: Classical Distillation -- DistilBERT (2019)

The earliest successful LLM distillation was DistilBERT from Hugging Face (Sanh et al., October 2019). The approach was straightforward: take BERT's 12-layer, 110M parameter architecture, remove every other layer to create a 6-layer, 66M parameter student, and train it to match BERT's output distributions using a combination of three loss functions:

Soft target loss: Match the teacher's full probability distribution (not just the top prediction)
Hard target loss: Standard cross-entropy against ground-truth labels
Cosine embedding loss: Align the student's hidden representations with the teacher's

The result retained 97% of BERT's performance at 60% of the size and 2x the inference speed. This was classical distillation: same architecture family, smaller version, trained to mimic the teacher's behavior. It worked well for BERT-scale models but couldn't bridge the gap between, say, GPT-4 and a 7B model.

Era 2: Distillation Through Prompting -- Alpaca and Vicuna (2023)

The pivotal shift came in March 2023 with Stanford's Alpaca. Instead of trying to match GPT-3.5's internal representations, the team simply prompted GPT-3.5 (text-davinci-003) to generate 52,000 instruction-following examples. They then fine-tuned LLaMA-7B on this synthetic dataset. The cost: under $500 in API fees. The result: a 7B model that, in blind evaluations, produced outputs qualitatively similar to GPT-3.5 on many tasks.

This was a conceptual revolution. The "distillation" happened entirely through data generation, not through any direct connection between teacher and student architectures. The teacher's knowledge was embedded in the examples it produced, and the student extracted that knowledge through standard supervised fine-tuning. Vicuna (March 2023) extended this by training on 70K user-shared ChatGPT conversations, achieving what evaluators rated as 90% of ChatGPT's quality.

The approach spawned an explosion of "distilled" models: WizardLM, Orca, OpenHermes, and dozens of others -- all built by prompting frontier models to generate training data and then fine-tuning smaller open-weight models on the results.

Era 3: Reasoning Trace Distillation -- DeepSeek R1 (2025)

The most sophisticated form of distillation emerged with DeepSeek-R1 in January 2025. Rather than distilling only final answers or general instruction-following ability, DeepSeek distilled reasoning itself -- the step-by-step chain of thought that a model uses to work through complex problems.

The process worked in stages. First, DeepSeek-R1 (a 671B MoE model trained with reinforcement learning) generated detailed reasoning traces for thousands of math, science, and coding problems. These traces included the model's internal deliberation: exploring approaches, checking intermediate results, backtracking from dead ends, and arriving at verified conclusions. Then, smaller models (ranging from 1.5B to 70B parameters, based on Qwen and LLaMA architectures) were fine-tuned on these reasoning traces using supervised learning.

The results were remarkable. DeepSeek-R1-Distill-Qwen-32B matched or exceeded OpenAI's o1-mini on math benchmarks. Even the 7B distilled variant showed genuine multi-step reasoning ability. The key insight was that reasoning traces contain far richer learning signal than input-output pairs alone. A student that sees the teacher's thinking process learns not just what to answer, but how to think.

The Modern Distillation Toolkit

By 2025, practical distillation combines multiple approaches:

Instruction distillation: Teacher generates diverse instruction-response pairs covering a broad task distribution
Chain-of-thought distillation: Teacher generates step-by-step reasoning for complex problems
Preference distillation: Teacher ranks multiple responses, creating preference data for RLHF/DPO training of the student
Curriculum distillation: Data is ordered from simple to complex, mimicking pedagogical scaffolding
Verification distillation: Teacher generates solutions and self-verifications, teaching the student to check its own work

Why It Matters

The Great Equalizer

Knowledge distillation through data generation is perhaps the single most democratizing technique in AI. It allows anyone with API access to a frontier model to create training data for a competitive smaller model. This is why the open-weight ecosystem exploded in 2023-2024: Meta released LLaMA weights, researchers used GPT-4/Claude to generate training data, and suddenly the community had capable models that rivaled commercial offerings. The cost of creating a competent instruction-following model dropped from hundreds of millions of dollars to a few thousand.

Reasoning at Scale

The DeepSeek-R1 distillation results suggest that reasoning ability -- long thought to require massive models and extensive RL training -- can be substantially compressed. If a 7B model can exhibit genuine multi-step reasoning after learning from a 671B model's reasoning traces, the implications for deployment are profound. Reasoning-capable models on phones, in browsers, and on edge devices become viable.

The Ethical and Legal Frontier

Distillation raises unresolved questions. When Alpaca was trained on GPT-3.5 outputs, OpenAI's terms of service prohibited using outputs to train competing models. Many distilled models exist in a legal gray area. The practice also raises questions about credit and attribution: if a small model's capabilities derive primarily from a larger model's generated data, who "owns" those capabilities?

Key Technical Details

DistilBERT (2019): 66M parameters, retained 97% of BERT-110M performance at 60% size
Alpaca (March 2023): 52K GPT-3.5-generated examples, fine-tuned LLaMA-7B, total cost ~$500
Vicuna (March 2023): trained on 70K ChatGPT conversations, rated at ~90% of ChatGPT quality
DeepSeek-R1 distilled variants: 1.5B, 7B, 8B, 14B, 32B, 70B -- all from 671B teacher's reasoning traces
R1-Distill-Qwen-32B matched o1-mini on AIME 2024 (72.6% vs 63.6%)
Modern distillation typically uses 50K-500K high-quality examples rather than millions of noisy ones
Chain-of-thought distillation produces 3-10x longer training sequences than answer-only distillation

Common Misconceptions

"Distillation just copies the teacher model." Modern distillation doesn't transfer weights or architecture -- it transfers knowledge through data. The student model develops its own internal representations, often quite different from the teacher's.
"Distilled models are always worse than their teachers." On specific tasks, distilled models can match or exceed teachers. DeepSeek-R1-Distill-Qwen-32B outperformed o1-mini on math despite being far smaller than the R1 teacher.
"You need access to the teacher's weights to distill." The modern paradigm only requires API access to generate training data. This is precisely why distillation became so democratizing.
"More distillation data is always better." Quality and diversity matter more than quantity. Phi's success with ~7B tokens of high-quality data versus trillions of web-crawled tokens demonstrates this principle clearly.
"Distillation only works for easy tasks." Reasoning trace distillation has shown that even complex multi-step reasoning can be distilled, challenging assumptions about what requires massive scale.

Connections to Other Concepts

Knowledge distillation is the enabling technique behind 01-phi-series.md (synthetic textbook data from GPT-3.5/4) and is central to the training of 02-gemma.md (Gemma 2's distillation from larger Gemini models). The reasoning distillation approach connects directly to 03-deepseek-r1.md and the broader reasoning paradigm in 05-the-reasoning-paradigm-shift.md. Distillation's role in the open ecosystem is key to understanding 07-open-vs-closed-the-narrowing-gap.md, and its combination with 05-lora-and-fine-tuning-democratization.md enables the rapid customization of small models.