Model Sensitivity
Add a space, swap a word, change a single token — and a language model’s answer can flip. Sensitivity is one of the most underdiscussed brittleness traits of modern LLMs, and a real engineering hazard.
The five-bullet version
- LLMs can produce wildly different outputs for inputs that look semantically identical.
- Sources of sensitivity: prompt phrasing, example order in few-shot, system-prompt vs user-prompt placement, even invisible whitespace.
- The cause is the autoregressive distribution: small input changes can move probability mass off the answer the model would otherwise emit.
- Measurable as “answer variance under paraphrase” — a useful eval metric beyond raw accuracy.
- Mitigations: average over paraphrases at inference, structured prompts, RL on robustness, calibration.
§ 00 · SMALL INPUTS, BIG SWINGSDemonstrating the problem
Show a model two prompts that look like the same question:
- “What’s the capital of Australia?”
- “What is Australia’s capital city?”
Same question, different surface form. A robust model gives the same answer to both. Real models sometimes don’t — the first might return “Canberra” and the second “Sydney” (the common-but-wrong answer most non-Australians produce when asked colloquially).
This is model sensitivitysensitivity. The phenomenon where small, semantically-irrelevant changes to a prompt produce large changes in the model's output. Distinct from intentional adversarial attacks — sensitivity is the model's brittleness on innocuous variations.: the system is more affected by surface variation than the semantics warrants. In a production setting, it’s a real reliability concern.
§ 01 · WHERE SENSITIVITY SHOWS UPCommon axes
- Phrasing.Active vs passive voice, formal vs informal, presence or absence of “please”.
- Order. The order of options in a multiple-choice question. The order of examples in a few-shot prompt.
- Whitespace and punctuation. Trailing newlines, double spaces, oxford comma. Real models show measurable accuracy differences across these.
- System vs user message. Identical instruction placed in the system message vs prepended to the user message produces different answers.
- Token-level perturbation. Replacing a word with a synonym, even a near-perfect one.
§ 02 · WHY MODELS ARE SENSITIVEThe autoregressive lens
An LLM’s output is determined by the next-token distribution at each step. Two inputs that look semantically identical can have slightly different next-token distributions. If the difference is in a region of high uncertainty — where the top two candidate tokens are close in probability — a tiny perturbation can flip which one wins.
Three contributing factors:
- Training data.If the model saw one phrasing far more often than another in training, it’s more confident on that phrasing.
- Tokenization.Different surface forms tokenize differently. “capital” and “ capital” (with leading space) are different tokens. Different tokens activate different attention patterns.
- Position effects. The same instruction at position 50 vs position 5000 of the context behaves differently — see the lost-in-the-middle phenomenon.
§ 03 · MEASURING ITSensitivity as an eval metric
Accuracy alone hides sensitivity. A model that gets 90% accuracy on a single phrasing might get 70% on the same task with different phrasings — and the 90% number gives you false confidence.
A useful eval pattern:
- For each test case, generate k paraphrases (via another LLM or a template-based system).
- Run the model on all k versions.
- Report two numbers: accuracy (any version correct) and consistency (all versions produce the same answer).
High accuracy with low consistency is a warning sign — the model sometimes gets it right by luck of phrasing, sometimes not.
§ 04 · MITIGATIONS AND THEIR LIMITSWhat helps, what doesn’t
Practical mitigations:
- Inference-time averaging. Run the model on several paraphrases of the input; majority-vote the answer. Expensive but measurably more reliable.
- Structured prompts. Wrap instructions in clear delimiters (XML tags, JSON schema). Reduces variance from formatting differences.
- Lower temperature. Greedy decoding is more consistent. (Not always more accurate.)
- RL on robustness. Some recent post-training recipes explicitly reward consistent answers across paraphrased inputs.
What doesn’t reliably help:
- Adding more examples — sometimes helps, sometimes adds new sensitivity to example order.
- Bigger model— bigger models are less sensitive on average but not immune. The gap shrinks but doesn’t close.
§ 05 · TAKING THIS FORWARDAdjacent reading
The Coherence lesson covers a related but distinct property — consistency across an extended generation, rather than across paraphrased inputs. Both are facets of the broader robustness problem for production LLMs.
§ · GOING DEEPERPrompt sensitivity is a real measurement problem
Sclar et al. (2023) and Mizrahi et al. (2023) documented a finding that should change how evals are read: the same capability measured with different but semantically equivalent prompts can shift by 10+ percentage points. Models that look like clear winners on one prompt template can be ties or losers on another. The field has been reporting single-prompt numbers and treating them as if they were robust capability measurements.
Two responses are emerging. Multi-prompt evaluation (run each test across 10+ paraphrases, report the distribution) gives a defensible estimate of mean performance with uncertainty bands. Lu et al. (2021) and Pezeshkpour & Hruschka (2023) further showed that the order of multiple-choice options matters — models prefer certain positions even when instructed not to. The current best practice for an honest eval is multi-prompt + permuted option order + reported variance, not a single point estimate.
§ · FURTHER READINGReferences & deeper sources
- (2023). Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design · arXiv
- (2023). State of What Art? A Call for Multi-Prompt LLM Evaluation · TACL
- (2021). Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity · ACL
- (2023). Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions · NAACL
- (2024). Efficient Multi-Prompt Evaluation of LLMs · arXiv
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.