Curriculum Learning — Advanced LLM Concepts

One-Line Summary: Curriculum learning presents training examples in a meaningful order — usually easy to hard — instead of randomly, and modern LLM data mixing is its most consequential form.

Prerequisites: Understanding of stochastic gradient descent, importance sampling, and how training data is mixed across domains in pre-training.

What It Is

The original idea is simple and human: don't show the model the hardest examples first. Bengio et al. (2009) formalized it as curriculum learning, drawing on intuitions from how humans are taught — start with cleaner, simpler examples, build a foundation, then add complexity.

"Difficulty" can be measured in many ways: per-example loss under a reference model, sequence length, perplexity, or domain quality scores. The curriculum can be hand-designed (like a textbook) or learned (with a small proxy model that scores examples on the fly).

The modern incarnation, and the one that actually moves frontier models, is dynamic data mixing — adjusting the proportions of code, math, scientific text, conversational web data, and books across the course of pre-training. Microsoft's Phi series leaned into this hard with curated "textbook-quality" data; Google's DoReMi (Xie et al., 2024) trained a small proxy model to learn optimal mixture weights for the bigger run.

Why It Matters

Two things make curriculum learning matter for production LLMs:

Same compute, better outcome. The theoretical justification (Weinshall et al.) frames curriculum learning as a form of importance sampling that reduces gradient variance early in training. Less noisy gradients early means a faster path to a good basin in the loss landscape.
It re-frames the data problem. Pre-training is no longer "scrape the web and shuffle" — it's a curriculum-design problem. What you train on, in what order, with what proportions, with what quality filtering, is now central to the recipe. The "data mixture" is as much a hyperparameter as the learning rate.

Anti-curriculum (hard examples first) also has its niches — particularly in contrastive learning, where exposing the model to hard negatives early can prevent representational collapse.

Key Technical Details

The curriculum can be static (decide the order up front) or dynamic (re-score examples as the model improves). Dynamic schemes are more powerful but more expensive — you need a way to re-evaluate difficulty without retraining the model. Self-paced learning, where the model itself decides which examples to study next based on its current loss, is the most aggressive version of this idea.