Courses·Foundational Architecture·7 min read

Next-Token Prediction

Next-token prediction is the deceptively simple training objective at the heart of all decoder-based LLMs -- predicting the most likely next token given all preceding tokens -- and this single objective, applied at sufficient scale, gives rise to emergent capabilities including grammar, factual knowledge, reasoning, and more.

One-Line Summary: Next-token prediction is the deceptively simple training objective at the heart of all decoder-based LLMs -- predicting the most likely next token given all preceding tokens -- and this single objective, applied at sufficient scale, gives rise to emergent capabilities including grammar, factual knowledge, reasoning, and more.

Prerequisites: Understanding of probability distributions, the concept of a language model, the softmax function, and basic information theory (entropy, cross-entropy are helpful but not required).

What Is Next-Token Prediction?

At its core, an LLM is trained to do one thing: given a sequence of tokens, predict what comes next. That is the entire training objective. There is no explicit instruction to learn grammar, no labeled dataset of facts, no reasoning curriculum. Just: read this text, predict the next word. Repeat, billions of times.

Next token prediction showing the model receiving a sequence and outputting probability distributions over the vocabulary at each position Source: Jay Alammar – The Illustrated GPT-2

The profundity of this approach is in its simplicity. Imagine training a child by showing them the first $n - 1$ words of millions of sentences and asking "What word comes next?" Over time, to get consistently good at this game, the child would need to learn:

Grammar: Knowing that "She are" is unlikely but "She is" is common.
Facts: Predicting "Paris" after "The capital of France is."
Reasoning: Predicting "4" after "2 + 2 =" (not because it memorized the equation, but because it has seen enough mathematical patterns).
Style and tone: Predicting formal language after a formal prompt, informal language after casual text.
World knowledge: Understanding that "The sun rises in the" is likely followed by "east."

All of these capabilities emerge implicitly from a single, simple objective.

How It Works

flowchart LR
    S1["Autoregressive language model training"]
    S2["how the model learns to predict each token"]
    S1 --> S2

The Training Objective

Given a training corpus with token sequences $t_{1}, t_{2}, \dots, t_{n}$ , the model maximizes the log-likelihood:

$L = \sum_{i = 1}^{n} lo g P (t_{i} ∣ t_{1}, t_{2}, \dots, t_{i - 1}; θ)$

or equivalently, minimizes the negative log-likelihood (cross-entropy loss):

$L_{C E} = - \frac{1}{n} \sum_{i = 1}^{n} lo g P (t_{i} ∣ t_{1}, \dots, t_{i - 1}; θ)$

where $θ$ represents the model's parameters.

Step-by-Step Training

Sample a text sequence from the training corpus (e.g., a 2048 or 4096-token chunk of text).
Feed the entire sequence through the model. Thanks to the causal mask, position $i$ can only attend to positions $1$ through $i$ .
At each position $i$ , the model outputs a probability distribution $P (t_{i} ∣ t_{1}, \dots, t_{i - 1})$ over the vocabulary.
Compute the loss: compare the predicted distribution at each position to the actual next token (using cross-entropy).
Backpropagate the loss and update weights.

Because the causal mask ensures no information leakage, we get $n$ training signals from a single sequence -- every position simultaneously provides a next-token prediction task. This is enormously efficient.

Connection to Information Theory

Next-token prediction is intimately connected to data compression. A model that perfectly predicts the next token can compress text with zero overhead (it assigns probability 1 to the correct token, costing $- lo g (1) = 0$ bits). A weaker model assigns lower probability to the correct token, wasting bits.

The cross-entropy loss is measured in nats (natural log) or bits (log base 2):

$H (P, Q) = - \sum_{x} P (x) lo g Q (x)$

Lower cross-entropy means the model is a better predictor, which means it is a better compressor. This is not just an analogy -- it is a mathematical equivalence via the source coding theorem. Some researchers argue that the remarkable capabilities of LLMs stem from the fact that good compression requires understanding. To compress all of human text well, you must model the underlying patterns in human knowledge, language, and reasoning.

The Compression Hypothesis

This idea, sometimes called the "compression hypothesis" or attributed to observations by researchers like Ilya Sutskever, posits:

To predict text well, you must model the process that generated the text.
Human text is generated by humans with knowledge, reasoning, and intentions.
Therefore, a sufficiently powerful next-token predictor must develop internal representations of knowledge, reasoning, and intent.

This is why "just predicting the next word" leads to models that can solve math problems, write code, analyze arguments, and exhibit general intelligence. The prediction task is a proxy for modeling the entire distribution of human thought as expressed in language.

Why It Matters

Emergent Intelligence from Simple Objectives

The most remarkable aspect of next-token prediction is the gap between the simplicity of the objective and the sophistication of the emergent behavior. The model is never told to reason, never taught logical rules, never given labeled examples of "good reasoning." Yet at sufficient scale, these capabilities appear.

This challenges traditional AI approaches that explicitly encode knowledge and rules. Next-token prediction demonstrates that general intelligence might emerge from a sufficiently powerful optimization process applied to a sufficiently rich objective.

Scaling Laws

Research by Kaplan et al. (2020) and Hoffmann et al. (2022, "Chinchilla") showed that the cross-entropy loss follows predictable power law relationships:

$L (N, D) \approx (\frac{N _{c}}{N})^{α_{N}} + (\frac{D _{c}}{D})^{α_{D}} + L_{\infty}$

where $N$ is model size (parameters), $D$ is dataset size (tokens), and $α_{N}, α_{D}$ are scaling exponents. This means:

Larger models and more data smoothly reduce loss.
The relationship is predictable, allowing researchers to estimate the performance of models they have not yet trained.
There is no sign of diminishing returns at current scales (though the rate of improvement decreases).

The Gap Between Loss and Capabilities

A subtlety: small improvements in cross-entropy loss can correspond to dramatic improvements in downstream capabilities. A model with 0.01 lower loss might go from being unable to solve a type of reasoning problem to solving it reliably. This is because the marginal loss improvement often comes from learning to handle the "hard" cases -- the ones requiring deeper understanding.

Key Technical Details

Vocabulary size: The model predicts over a vocabulary of 32K-128K+ tokens (subwords, not whole words). The final layer is a linear projection from $d_{m o d e l}$ to vocabulary size.
Per-token loss: Each token contributes equally to the loss by default, though some training recipes weight tokens differently (e.g., upweighting code or math).
Teacher forcing: During training, the model always sees ground-truth previous tokens, not its own predictions. This enables parallel training but creates a train-test mismatch (exposure bias).
Cross-entropy loss values: Typical losses range from 3.0+ (small models, early training) to below 1.5 (large, well-trained models). These correspond to perplexities (exponentiated cross-entropy) of roughly 20+ down to under 5.
Bits per byte (BPB): A hardware-independent metric that normalizes cross-entropy by the number of bytes per token. Used for comparing models with different tokenizers.
The loss landscape: Next-token prediction loss is well-behaved (smooth, convex-like) at scale, which is part of why training is stable for very large models.

Common Misconceptions

"Next-token prediction is just memorization." If a model only memorized sequences, it would assign zero probability to any sequence not in the training data. LLMs generalize -- they produce coherent text on topics, in styles, and in combinations never seen in training. Generalization requires understanding, not just memorization.
"The model predicts exactly one next token." The model produces a probability distribution over all possible next tokens. Many tokens might have significant probability. The specific token chosen depends on the decoding strategy (greedy, sampling, etc.).
"Predicting the next token is a shallow, surface-level task." Predicting the next token in complex text (mathematical proofs, legal arguments, code) requires deep structural understanding. The prediction task is as complex as the text being modeled.
"Better next-token prediction always means better downstream performance." Generally true at the macro level, but the relationship is not perfectly linear. Capability emergence can be sudden -- a model can go from 0% to 80% accuracy on a task over a small range of loss reduction.
"LLMs only learn from text they are directly trained on." LLMs learn patterns from training data, not just the data itself. They can combine patterns in novel ways, which is why they can answer questions about topics not literally present in their training set.

Connections to Other Concepts

autoregressive-generation.md: The inference-time manifestation of the next-token prediction objective (see autoregressive-generation.md).
causal-attention.md: The masking mechanism that enables parallel training of the next-token objective (see causal-attention.md).
logits-and-softmax.md: The output layer that converts hidden states into next-token probability distributions (see logits-and-softmax.md).
encoder-decoder-architecture.md: Encoder-only models use masked LM instead of next-token prediction; a different training paradigm with different tradeoffs (see encoder-decoder-architecture.md).
feed-forward-networks.md: Where the factual knowledge needed for good next-token prediction is stored (see feed-forward-networks.md).