Courses·Inference And Deployment·7 min read

Knowledge Distillation

Knowledge distillation trains a smaller "student" model to mimic the behavior of a larger "teacher" model by learning from the teacher's soft probability distributions rather than just hard labels, transferring rich knowledge about inter-class relationships that the raw training data alone cannot convey.

One-Line Summary: Knowledge distillation trains a smaller "student" model to mimic the behavior of a larger "teacher" model by learning from the teacher's soft probability distributions rather than just hard labels, transferring rich knowledge about inter-class relationships that the raw training data alone cannot convey.

Prerequisites: Softmax and temperature scaling, cross-entropy loss, model training basics (forward pass, backpropagation, loss functions), the concept of model capacity, fine-tuning.

What Is Knowledge Distillation?

Consider how an experienced mentor teaches a junior colleague. The mentor does not simply say "the answer is X." They explain why it is X, what other possibilities were considered and why they were less likely, and what subtle patterns to look for. This nuanced guidance transfers far more knowledge than a simple answer key.

Source: Wikimedia Commons - Knowledge Distillation

Knowledge distillation works the same way. When a large teacher model predicts the next token, it does not just output the correct token -- it produces a full probability distribution over the entire vocabulary. A token predicted with 70% confidence, with 15% on a close synonym and 5% on a related word, contains rich information about the structure of language. Distillation trains the student to reproduce this full distribution, not just match the top-1 answer.

Introduced by Hinton, Vartia, and Dean in 2015, distillation has become one of the primary methods for creating smaller, deployable models from large, capable ones.

How It Works

flowchart LR
    S1["Illustration of soft label distributions"]
    S2["dark knowledge in the teacher's probabilit"]
    S1 --> S2

See detailed distillation architecture diagrams at: Lilian Weng - The Transformer Family v2

The Teacher-Student Framework

Teacher model: A large, powerful model (already trained). During distillation, its weights are frozen -- it only provides predictions.
Student model: A smaller model to be trained. It has fewer layers, smaller hidden dimensions, or both.
Training data: The same data (or a subset) used to train the teacher, or new data where the teacher generates "soft labels."

Soft Labels vs. Hard Labels

Hard labels are the ground truth: the next token is "cat" (a one-hot vector with 1.0 on "cat" and 0.0 everywhere else).

Soft labels are the teacher's probability distribution: "cat" = 0.72, "kitten" = 0.12, "feline" = 0.05, "dog" = 0.03, ...

The soft labels encode what the teacher has learned about the relationships between tokens. The fact that "kitten" gets 12% tells the student that these words are closely related. The fact that "dog" gets 3% (not 0%) tells the student that animals are related to each other. Hard labels contain none of this information.

Temperature in Distillation

To make the soft labels even more informative, both teacher and student outputs are passed through a temperature-scaled softmax:

$p_{i} = \frac{e x p ( z _{i} / T )}{\sum _{j} e x p ( z _{j} / T )}$

A high temperature (T = 4 to 20) softens the distributions, revealing more of the teacher's "dark knowledge" -- the subtle probability mass on tokens that would be near-zero at T = 1. This dark knowledge is where much of the useful structural information resides.

The Distillation Loss

The total training loss is typically a weighted combination:

$L = α \cdot T^{2} \cdot KL (p_{teacher}^{(T)} ∥ p_{student}^{(T)}) + (1 - α) \cdot L_{CE} (y, p_{student}^{(1)})$

Where:

The first term is the distillation loss: KL divergence between the teacher's and student's softened distributions at temperature T. The T^2 factor normalizes gradients to keep them on the same scale regardless of temperature.
The second term is the standard cross-entropy loss against the hard labels at temperature 1.
Alpha controls the balance (typically 0.5-0.9, favoring the distillation loss).

Why Soft Labels Carry More Information

Consider a vocabulary of 50,000 tokens. A hard label is a one-hot vector -- it communicates exactly log2(50,000) = ~15.6 bits of information per training example. The teacher's soft distribution communicates much more: it provides a probability for every token, encoding the teacher's understanding of semantic similarity, grammatical plausibility, and contextual relevance. This richer signal makes learning more sample-efficient and produces a student that generalizes better.

Practical Examples in LLMs

GPT-4 to smaller models: While not publicly documented in detail, it is widely understood that many smaller commercial models benefit from distillation. A powerful model generates high-quality training data (including its reasoning traces), which is then used to train smaller models. This is sometimes called "data distillation" or "synthetic data generation," and it blurs the line between traditional distillation and training on synthetic data.

Minitron (NVIDIA): The Minitron approach combines structured pruning with distillation:

Start with a large pre-trained model (e.g., 15B parameters).
Apply structured pruning -- remove entire attention heads, FFN neurons, or layers based on importance scores.
The pruned model is damaged (higher perplexity). Use the original large model as a teacher to distill knowledge back into the pruned model.
The result is a compact model (e.g., 8B) that performs significantly better than training an 8B model from scratch, at a fraction of the compute cost.

DistilBERT and DistilGPT-2: Early prominent examples where distillation produced models 40-60% smaller with 97% of the teacher's performance, running 60% faster.

Distillation for Deployment

In production settings, distillation serves a specific role in the deployment pipeline:

See Minitron pruning + distillation pipeline diagram at: NVIDIA Minitron Paper (arXiv:2407.14679)

Train or obtain the largest, best model (teacher).
Evaluate what performance level is acceptable for the application.
Distill to the smallest student that meets the quality bar.
Deploy the student, which has lower latency, lower memory requirements, and lower cost per query.

This is often combined with quantization: distill to a smaller architecture, then quantize to INT4, achieving compound compression.

Why It Matters

Distillation is the bridge between frontier model capability and practical deployment:

Cost reduction: A distilled model may be 10-50x cheaper to serve per query.
Latency improvement: Smaller models generate tokens faster.
Accessibility: Distilled models can run on consumer hardware.
Specialization: A general-purpose teacher can be distilled into a specialist student for a specific domain, often outperforming the teacher on that narrow task.

The economic significance is enormous. If a distilled 8B model can handle 90% of the queries that a 70B model handles, a serving infrastructure can route the easy queries to the cheap model and only use the expensive model for hard queries, dramatically reducing costs.

Key Technical Details

Feature-based distillation: Beyond matching output distributions, some methods match intermediate representations (hidden states, attention patterns) between teacher and student. This provides additional training signal but requires architectural compatibility.
Online vs. offline distillation: Offline distillation pre-computes teacher outputs and stores them. Online distillation runs the teacher during training. Offline is more practical for large teachers but requires storage for soft labels.
Self-distillation: A model distills knowledge from a larger version of itself, or from an ensemble of copies trained with different random seeds. Surprisingly effective even without a separate teacher.
Multi-teacher distillation: The student learns from multiple teachers, potentially capturing diverse knowledge that no single teacher possesses.
Progressive distillation: Distill in stages (e.g., 70B to 30B to 13B to 7B) rather than in one large step, which can improve final quality.

Common Misconceptions

"Distillation just trains on the teacher's outputs." While training on teacher-generated text (synthetic data) is a form of distillation, classical distillation specifically uses the full probability distributions, which contain far more information per example than the generated tokens alone.
"The student can match the teacher." Generally, the student has a hard ceiling imposed by its reduced capacity. A 1B model cannot fully replicate a 70B model's capabilities. Distillation helps the student reach its potential, but that potential is bounded.
"Distillation is the same as fine-tuning." Fine-tuning adapts a pre-trained model to a task using labeled data. Distillation transfers knowledge from a larger model using soft labels. They are different processes with different objectives, though they can be combined.
"You need the exact same training data." The student can be distilled on different data than the teacher was trained on. In practice, using a representative dataset is sufficient.
"Temperature in distillation is the same as temperature in sampling." The mathematical formula is identical, but the purpose is different. In distillation, high temperature reveals dark knowledge for training. In sampling, temperature controls output diversity.

Connections to Other Concepts

quantization.md: Distillation reduces parameter count (architectural compression); quantization reduces bits per parameter (precision compression). They stack: distill, then quantize.
speculative-decoding.md: A distilled small model makes an excellent draft model for speculative decoding of its teacher, combining two optimizations.
sampling-strategies.md: The temperature parameter in distillation directly parallels temperature in sampling, though serving a different purpose.
model-serving.md: Distilled models are easier to serve -- they fit on fewer GPUs, have smaller KV caches, and generate tokens faster, simplifying the entire serving infrastructure.
flash-attention.md: Smaller distilled models benefit less from Flash Attention (shorter sequences are already fast), but the attention optimization still helps during prefill.