One-Line Summary: In-context learning is the surprising ability of large language models to acquire a new task from a handful of examples shown in the prompt, without any parameter updates.

Prerequisites: Familiarity with autoregressive next-token prediction, the attention mechanism, and the difference between training-time and inference-time computation.

What It Is

In-context learning (ICL) is the phenomenon where a frozen language model, given a few input–output demonstrations followed by a new input, produces the correct output — effectively "learning" the task from context alone. No gradients flow. No weights move. The model simply reads the examples and continues the pattern.

The classic demonstration looks like this:

sea otter -> loutre de mer
peppermint -> menthe poivrée
plush giraffe -> ?

A model that has never been told it should translate French will still produce girafe en peluche. The task isn't trained — it's inferred from the surrounding text.

Why It Matters

ICL is the reason a single base model can perform thousands of distinct tasks — translation, classification, format conversion, multi-step reasoning — without a separate fine-tune for each. It collapses the old "one task, one model" assumption that defined the pre-LLM era of NLP.

There are several competing theories for what ICL actually is:

  • Implicit Bayesian inference — the model infers the most likely task posterior given the demonstrations, then samples completions consistent with that posterior.
  • Implicit gradient descent — Garg et al. (2022) and others have shown that the attention mechanism can express something analogous to gradient steps, suggesting the forward pass quietly performs optimization on the in-context examples.
  • Task-vector formation — Hendel et al. (2023) found that demonstrations create a compact internal "task vector" in the residual stream, and you can extract it, store it, and inject it into other prompts.

A particularly clean mechanistic story comes from Olsson et al. (2022) at Anthropic: induction heads — pairs of attention heads that detect "A B ... A" patterns and complete them with "B". Induction heads emerge sharply during training, and their formation correlates with the appearance of ICL. They are, in a real sense, where the magic lives.

Key Technical Details

The number, ordering, and similarity of the few-shot examples to the test input all dramatically affect performance. Recency bias is real (the last example carries the most weight), and label distribution in the prompt can override the actual semantics. ICL is not equivalent to fine-tuning — it is faster but more brittle, more sensitive to formatting, and bounded by the context window.