Latest Research · Module 32·7 min read

Model Sensitivity

Add a space, swap a word, change a single token — and a language model’s answer can flip. Sensitivity is one of the most underdiscussed brittleness traits of modern LLMs, and a real engineering hazard.

Brain Drip EditorsUpdated May 2026·9 references

The five-bullet version

LLMs can produce wildly different outputs for inputs that look semantically identical.
Sources of sensitivity: prompt phrasing, example order in few-shot, system-prompt vs user-prompt placement, even invisible whitespace.
The cause is the autoregressive distribution: small input changes can move probability mass off the answer the model would otherwise emit.
Measurable as “answer variance under paraphrase” — a useful eval metric beyond raw accuracy.
Mitigations: average over paraphrases at inference, structured prompts, RL on robustness, calibration.

§ 00 · SMALL INPUTS, BIG SWINGSDemonstrating the problem

Show a model two prompts that look like the same question:

“What’s the capital of Australia?”
“What is Australia’s capital city?”

Same question, different surface form. A robust model gives the same answer to both. Real models sometimes don’t — the first might return “Canberra” and the second “Sydney” (the common-but-wrong answer most non-Australians produce when asked colloquially).

This is model sensitivitysensitivity. The phenomenon where small, semantically-irrelevant changes to a prompt produce large changes in the model's output. Distinct from intentional adversarial attacks — sensitivity is the model's brittleness on innocuous variations.: the system is more affected by surface variation than the semantics warrants. In a production setting, it’s a real reliability concern.

§ 01 · WHERE SENSITIVITY SHOWS UPCommon axes

Phrasing.Active vs passive voice, formal vs informal, presence or absence of “please”.
Order. The order of options in a multiple-choice question. The order of examples in a few-shot prompt.
Whitespace and punctuation. Trailing newlines, double spaces, oxford comma. Real models show measurable accuracy differences across these.
System vs user message. Identical instruction placed in the system message vs prepended to the user message produces different answers.
Token-level perturbation. Replacing a word with a synonym, even a near-perfect one.

§ 02 · WHY MODELS ARE SENSITIVEThe autoregressive lens

An LLM’s output is determined by the next-token distribution at each step. Two inputs that look semantically identical can have slightly different next-token distributions. If the difference is in a region of high uncertainty — where the top two candidate tokens are close in probability — a tiny perturbation can flip which one wins.

Three contributing factors:

Training data.If the model saw one phrasing far more often than another in training, it’s more confident on that phrasing.
Tokenization.Different surface forms tokenize differently. “capital” and “ capital” (with leading space) are different tokens. Different tokens activate different attention patterns.
Position effects. The same instruction at position 50 vs position 5000 of the context behaves differently — see the lost-in-the-middle phenomenon.

§ 03 · MEASURING ITSensitivity as an eval metric

Accuracy alone hides sensitivity. A model that gets 90% accuracy on a single phrasing might get 70% on the same task with different phrasings — and the 90% number gives you false confidence.

A useful eval pattern:

For each test case, generate k paraphrases (via another LLM or a template-based system).
Run the model on all k versions.
Report two numbers: accuracy (any version correct) and consistency (all versions produce the same answer).

High accuracy with low consistency is a warning sign — the model sometimes gets it right by luck of phrasing, sometimes not.

Fig 1Sensitivity persists across the model-size frontier. Better than smaller models, but not solved.

§ 04 · MITIGATIONS AND THEIR LIMITSWhat helps, what doesn’t

Practical mitigations:

Inference-time averaging. Run the model on several paraphrases of the input; majority-vote the answer. Expensive but measurably more reliable.
Structured prompts. Wrap instructions in clear delimiters (XML tags, JSON schema). Reduces variance from formatting differences.
Lower temperature. Greedy decoding is more consistent. (Not always more accurate.)
RL on robustness. Some recent post-training recipes explicitly reward consistent answers across paraphrased inputs.

What doesn’t reliably help:

Adding more examples — sometimes helps, sometimes adds new sensitivity to example order.
Bigger model— bigger models are less sensitive on average but not immune. The gap shrinks but doesn’t close.

CHECKYour LLM-based classifier scores 92% on a benchmark, but you notice it sometimes flip-flops on essentially-identical inputs. How do you quantify the issue?

§ 05 · TAKING THIS FORWARDAdjacent reading

The Coherence lesson covers a related but distinct property — consistency across an extended generation, rather than across paraphrased inputs. Both are facets of the broader robustness problem for production LLMs.

§ · GOING DEEPERPrompt sensitivity is a real measurement problem

Sclar et al. (2023) and Mizrahi et al. (2023) documented a finding that should change how evals are read: the same capability measured with different but semantically equivalent prompts can shift by 10+ percentage points. Models that look like clear winners on one prompt template can be ties or losers on another. The field has been reporting single-prompt numbers and treating them as if they were robust capability measurements.

Two responses are emerging. Multi-prompt evaluation (run each test across 10+ paraphrases, report the distribution) gives a defensible estimate of mean performance with uncertainty bands. Lu et al. (2021) and Pezeshkpour & Hruschka (2023) further showed that the order of multiple-choice options matters — models prefer certain positions even when instructed not to. The current best practice for an honest eval is multi-prompt + permuted option order + reported variance, not a single point estimate.

§ · FURTHER READINGReferences & deeper sources

Sclar et al. (2023). Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design · arXiv
Mizrahi et al. (2023). State of What Art? A Call for Multi-Prompt LLM Evaluation · TACL
Lu et al. (2021). Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity · ACL
Pezeshkpour, Hruschka (2023). Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions · NAACL
Polo et al. (2024). Efficient Multi-Prompt Evaluation of LLMs · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.