Latest Research · Module 32·7 min read

Model Sensitivity

Add a space, swap a word, change a single token — and a language model’s answer can flip. Sensitivity is one of the most underdiscussed brittleness traits of modern LLMs, and a real engineering hazard.

The five-bullet version

  • LLMs can produce wildly different outputs for inputs that look semantically identical.
  • Sources of sensitivity: prompt phrasing, example order in few-shot, system-prompt vs user-prompt placement, even invisible whitespace.
  • The cause is the autoregressive distribution: small input changes can move probability mass off the answer the model would otherwise emit.
  • Measurable as “answer variance under paraphrase” — a useful eval metric beyond raw accuracy.
  • Mitigations: average over paraphrases at inference, structured prompts, RL on robustness, calibration.

§ 00 · SMALL INPUTS, BIG SWINGSDemonstrating the problem

Show a model two prompts that look like the same question:

Same question, different surface form. A robust model gives the same answer to both. Real models sometimes don’t — the first might return “Canberra” and the second “Sydney” (the common-but-wrong answer most non-Australians produce when asked colloquially).

This is model sensitivitysensitivity. The phenomenon where small, semantically-irrelevant changes to a prompt produce large changes in the model's output. Distinct from intentional adversarial attacks — sensitivity is the model's brittleness on innocuous variations.: the system is more affected by surface variation than the semantics warrants. In a production setting, it’s a real reliability concern.

§ 01 · WHERE SENSITIVITY SHOWS UPCommon axes

§ 02 · WHY MODELS ARE SENSITIVEThe autoregressive lens

An LLM’s output is determined by the next-token distribution at each step. Two inputs that look semantically identical can have slightly different next-token distributions. If the difference is in a region of high uncertainty — where the top two candidate tokens are close in probability — a tiny perturbation can flip which one wins.

Three contributing factors:

§ 03 · MEASURING ITSensitivity as an eval metric

Accuracy alone hides sensitivity. A model that gets 90% accuracy on a single phrasing might get 70% on the same task with different phrasings — and the 90% number gives you false confidence.

A useful eval pattern:

  1. For each test case, generate k paraphrases (via another LLM or a template-based system).
  2. Run the model on all k versions.
  3. Report two numbers: accuracy (any version correct) and consistency (all versions produce the same answer).

High accuracy with low consistency is a warning sign — the model sometimes gets it right by luck of phrasing, sometimes not.

score (%)60301B75457B856570B9278FrontierAccuracy (single phrasing)Consistency (across paraphrases)
Fig 1Sensitivity persists across the model-size frontier. Better than smaller models, but not solved.

§ 04 · MITIGATIONS AND THEIR LIMITSWhat helps, what doesn’t

Practical mitigations:

What doesn’t reliably help:

CHECKYour LLM-based classifier scores 92% on a benchmark, but you notice it sometimes flip-flops on essentially-identical inputs. How do you quantify the issue?

§ 05 · TAKING THIS FORWARDAdjacent reading

The Coherence lesson covers a related but distinct property — consistency across an extended generation, rather than across paraphrased inputs. Both are facets of the broader robustness problem for production LLMs.

§ · GOING DEEPERPrompt sensitivity is a real measurement problem

Sclar et al. (2023) and Mizrahi et al. (2023) documented a finding that should change how evals are read: the same capability measured with different but semantically equivalent prompts can shift by 10+ percentage points. Models that look like clear winners on one prompt template can be ties or losers on another. The field has been reporting single-prompt numbers and treating them as if they were robust capability measurements.

Two responses are emerging. Multi-prompt evaluation (run each test across 10+ paraphrases, report the distribution) gives a defensible estimate of mean performance with uncertainty bands. Lu et al. (2021) and Pezeshkpour & Hruschka (2023) further showed that the order of multiple-choice options matters — models prefer certain positions even when instructed not to. The current best practice for an honest eval is multi-prompt + permuted option order + reported variance, not a single point estimate.

§ · FURTHER READINGReferences & deeper sources

  1. Sclar et al. (2023). Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design · arXiv
  2. Mizrahi et al. (2023). State of What Art? A Call for Multi-Prompt LLM Evaluation · TACL
  3. Lu et al. (2021). Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity · ACL
  4. Pezeshkpour, Hruschka (2023). Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions · NAACL
  5. Polo et al. (2024). Efficient Multi-Prompt Evaluation of LLMs · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.