LLM Sampling
A model doesn’t answer your question — it produces a distribution over tokens, every time. Sampling is the recipe for turning that distribution into one specific word. Get the recipe wrong and the same model sounds repetitive, or unhinged.
The five-bullet version
- An LLM’s output is a probability distribution over its vocabulary, every step. Sampling picks one token.
- Greedy: always pick the most likely. Deterministic, often dull, sometimes loops.
- Temperature: divide logits by T before softmax. Lower = sharper, higher = flatter.
- Top-k: keep only the k highest-prob tokens, sample from those. Top-p / nucleus: keep enough to cover probability mass p.
- Use temp=0 for factual tasks, temp~0.7 + top-p~0.9 for prose, temp~1.2 for brainstorming.
§ 00 · THE NEXT-TOKEN PROBLEMWhat the model actually outputs
When you ask an LLM a question, here’s what really happens. The model runs your prompt through the transformer once and emits a single thing: a vector of V numbers, where V is the vocabulary size — roughly 100,000 entries for a modern model. Each number is a logitlogit. The raw score the model assigns to a token before any normalization. Logits can be any real number, positive or negative. To turn them into probabilities, you apply softmax.: an unnormalized score for the corresponding token.
To turn logits into probabilities, you apply softmax(x) = exp(xᵢ) / Σ exp(xⱼ). The result is a clean probability distribution: every entry between 0 and 1, total sums to 1. That distribution is what the model believes about the next token — not a single guess but a complete opinion across the whole vocabulary.
Sampling is the conversion from distribution to token. It feels like a clerical step. It isn’t: the same model with two different sampling recipes can sound like a different writer.
§ 01 · GREEDY & LOG-PROBSPick the top one — and what that costs
The simplest possible sampler: take argmax of the probability vector. Whichever token has the highest probability, emit that one. Repeat for the next position. This is greedy decodinggreedy decoding. A sampling strategy that always picks the single most likely next token. Deterministic, fast, often repetitive. The 'safe' baseline.. It’s deterministic — feed the same prompt twice, get the same output. That sounds desirable, and for some tasks it is.
The downside: greedy decoding gets stuck. Once the model is in a loop — the same five tokens in a row, the same sentence repeated — there’s no way out, because the most likely next token is always the one that continues the loop. You see this most often when:
- The model isn’t sure what to say (low entropy in the wrong place).
- A specific phrase has overwhelming probability at every step (“In conclusion, in conclusion, in conclusion…”).
- You’ve asked for something creative and gotten the median answer.
The fix: introduce some randomness. The next three knobs all do exactly that, in different ways.
§ 02 · TEMPERATURESharpening or flattening the distribution
Temperaturetemperature. A scalar T that divides every logit before softmax. T < 1 sharpens the distribution (top tokens get even more probability). T > 1 flattens it (low-prob tokens get a real shot). T = 0 is greedy. is the most direct knob: a scalar T that you divide every logit by, before softmax. softmax(logits / T).
- T = 1— the model’s native distribution, unmodified.
- T < 1 — distribution sharpens. The top tokens get even more probability; the tail dies faster. At
T = 0the top token has probability 1 — this is greedy. - T > 1 — distribution flattens. Even unlikely tokens get a real shot. At
T = 2or higher, the model will say strange and surprising things.
“The cat sat on the” → distribution over next tokensHighlighted bar = greedy pick. Faded rows are filtered out by top-k or top-p. After filtering, the survivors are renormalized so they sum to 100% again.
Slide temperature: the bars stretch and squish. Slide it to 0.1 and the bars collapse to one tall column — the top token wins virtually every time. Slide it to 2.5 and the distribution looks almost uniform — even moon and algorithm are in play.
§ 03 · TOP-K & TOP-P (NUCLEUS)Cutting off the long tail
Temperature changes the shapeof the distribution but never cuts anything off — every token still has nonzero probability. For high temperatures, that’s a problem. moonin the example is a perfectly valid word but a terrible continuation of “the cat sat on the.” You want randomness, but not that much randomness.
Two filters fix this:
- Top-k. Sort tokens by probability. Keep only the top
k. Renormalize and sample from those. Crude but fast.k = 40is a common default. - Top-p (nucleus).Sort by probability. Walk down the list, accumulating probability mass. Stop when you’ve covered
p(e.g.p = 0.9= 90% of the mass). Keep only those tokens. Sample from them.
Top-p is the more thoughtful filter because it adapts to the distribution shape. When the model is confident (most of the probability concentrated in 2-3 tokens), top-p keeps just those 2-3. When the model is uncertain (probability spread across 30 plausible continuations), top-p keeps all 30. Top-k always keeps exactly k, even when it shouldn’t.
These can stack. The standard recipe used by most chat APIs is temperature < 1 + top-p ≈ 0.9. Temperature controls how spiky the distribution is; top-p crops off the long tail of absurdities. Some implementations also support repetition penaltyrepetition penalty. An extra term that down-weights tokens already seen recently in the output. Helps prevent loops without going full random., which discounts tokens that have appeared recently — useful for avoiding loops while keeping temperature low.
§ 04 · PICKING A RECIPEDefaults that actually work
Rough rules of thumb, by task:
- Factual extraction or classification.
temperature = 0. Don’t introduce randomness when you’re asking for a fact. The model should always give the same answer to the same input. - Code generation.
temperature = 0(or very low, like 0.2). Code has narrow correct answers; randomness mostly breaks it. - Prose / chat / explanation.
temperature = 0.7,top_p = 0.9. The default for most consumer chat. Enough variation to sound human, not enough to derail. - Brainstorming / creative writing.
temperature = 1.0–1.3,top_p = 0.95. You want surprise. The cost is occasional weird turns.
Two warnings:
- Sampling doesn’t fix a confused model.If the model doesn’t know the answer, no amount of sampling tweaks will make it know. Sampling shapes how the model presents what it knows; it doesn’t change what it knows.
- Eval before you ship a temperature change. A small change to temperature can swing eval scores noticeably. Whatever the default is, write it down, and re-run your evals if you change it.
§ 05 · TAKING THIS FORWARDBeyond the three knobs
Three more advanced sampling ideas you’ll meet in production:
- Constrained decoding. Force the model to emit only tokens that match a grammar (JSON schema, regex, function signature). Implemented by masking logits to ban anything outside the grammar. Useful for structured outputs that must always parse.
- Speculative decoding.Use a small, fast model to propose several tokens; verify them with the big model in parallel. Doesn’t change the distribution — just makes inference faster.
- Beam search. Instead of one greedy thread, keep the top
kpartial outputs in parallel and keep the best full one. Common in translation, much less common in chat LLMs. Tends to produce safer, blander outputs than sampling.
For most application work, you don’t need any of these — just the temperature/top-p defaults above, and a habit of treating sampling as a real hyperparameter rather than a number you copied from a blog post.
§ · GOING DEEPERBeyond temperature and top-p
The three classic knobs — temperature, top-k, top-p (nucleus, Holtzman et al. 2019) — handle most cases. Two newer techniques are worth knowing. Speculative decoding(Leviathan et al. 2022) uses a small draft model to propose several tokens at once; the large model verifies them in parallel. Same output distribution, 2–3× speedup at inference. Every modern serving stack (vLLM, TensorRT-LLM) supports it.
For tasks where the model knows what it doesn’t know,contrastive decoding (Su et al. 2022) andDoLa (Chuang et al. 2024) suppress tokens that score similarly across layers — a heuristic that reduces confident hallucinations. And for structured outputs (JSON, XML, code with a grammar), constrained decoding— restrict sampling to tokens consistent with the grammar — eliminates an entire class of output bugs.
§ · FURTHER READINGReferences & deeper sources
- (2019). The Curious Case of Neural Text Degeneration (top-p / nucleus sampling) · ICLR
- (2018). Hierarchical Neural Story Generation (top-k sampling) · ACL
- (2022). Fast Inference from Transformers via Speculative Decoding · ICML
- (2022). Contrastive Decoding: Open-Ended Text Generation as Optimization · ACL
- (2022). Truncation Sampling as Language Model Desmoothing · EMNLP
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.