Multi-Token Prediction — Advanced LLM Concepts

One-Line Summary: Multi-token prediction trains the model to predict several future tokens at once through parallel prediction heads, producing richer representations and faster inference.

Prerequisites: Understanding of next-token prediction, the cross-entropy loss, and the cost structure of transformer training (backbone vs. heads).

What It Is

Standard language models predict the next token, full stop. Multi-token prediction (MTP) keeps the same shared transformer backbone, but instead of one prediction head, it adds several — head 1 predicts token t+1, head 2 predicts t+2, head 3 predicts t+3, and so on. All heads compute their loss in parallel during training. Meta's Gloeckle et al. (2024) demonstrated this at production scale; DeepSeek-V3 made MTP a load-bearing piece of its architecture.

backbone -> [head_1: t+1] [head_2: t+2] [head_3: t+3] [head_4: t+4]
              ^               ^             ^             ^
              standard        +planning     +planning    +planning

Why It Matters

Predicting four tokens ahead from the same hidden state forces the model to develop forward-looking representations rather than purely local ones. This is especially helpful for code generation, where the right next token often depends on a structural commitment several tokens out (closing a bracket, completing an argument list, returning a value).

There's also a concrete inference win. The auxiliary heads can be repurposed at decode time as a built-in speculative decoder: head 2 proposes a draft for t+2, head 3 proposes one for t+3, and you verify them against the main head in a single forward pass. This typically yields 1.5–2× speedup with no separate draft model needed, no extra memory cost, and no quality loss.

Key Technical Details

The training overhead is modest — usually 10–15% — because the heads are tiny relative to the transformer backbone. Loss weighting matters: heads predicting further out are noisier, so they often get smaller weights. At inference, you can keep the auxiliary heads (for speculative decoding) or drop them entirely (saving memory) and run only the main head.