Overthinking
Reasoning models like o1 and R1 sometimes reason themselves into the wrong answer. More tokens, more chances to get lost. The fix is subtler than “think less.”
The five-bullet version
- Modern reasoning models spend hundreds to thousands of tokens thinking before answering.
- For hard problems this is a win. For easy ones, it can hurt — the model talks itself out of a correct first instinct.
- Empirically: accuracy vs chain length is U-shaped on many tasks. Past the sweet spot, longer = worse.
- The failure mode looks like: confident wrong answer flips to wrong-different answer, then to elaborate justification.
- Mitigation: train the model to stop thinking when confident; budget chain length; provide hard early stops.
§ 00 · MORE THINKING, MORE WRONG?The counterintuitive finding
When o1 and R1 first showed that longer chains of thought produce better answers on math and code, the conclusion seemed clean: more thinking = more right. The 2025 research literature has complicated this.
On hard problems, more thinking does help — up to a point. On problems the model would have gotten right in 50 tokens, asking for 2,000 tokens of reasoning sometimes makes it worse. The model has time to second-guess, introduce errors, talk itself into the wrong answer.
This is the overthinkingoverthinking. A documented failure mode in reasoning models: when given (or trained to use) excessive chain-of-thought length, the model degrades on problems it would have solved correctly with shorter reasoning. The chain introduces errors that compound. problem. Length isn’t free; it’s a hyperparameter.
§ 01 · THE EMPIRICAL CURVEAccuracy vs chain length, by task
Across recent reasoning-model benchmarks, the picture looks roughly:
- Easy tasks (arithmetic with small numbers, simple fact recall): accuracy peaks at short chain length. Beyond ~50–200 tokens of CoT, accuracy drops.
- Medium tasks (basic logic, multi-step word problems): accuracy improves with chain length up to a few hundred tokens, then plateaus.
- Hard tasks (olympiad-style math, complex code): accuracy keeps improving with chain length to thousands or tens of thousands of tokens. The frontier on the hardest benchmarks is set by long chains.
§ 02 · WHY THIS HAPPENSThree contributing mechanisms
- Compounding error. Each generated token can introduce a small mistake; long chains accumulate them.
- Self-doubt.Trained to consider alternatives, models sometimes argue themselves out of a correct first answer. You’ve seen this in humans too — the overthinking-on-a-test phenomenon.
- Distribution shift. Mid-chain, the model is continuing text it generated itself, which may drift from the training distribution. The further into the chain, the more model-generated context dominates, and the higher the risk of weird states.
§ 03 · CALIBRATING THINKING LENGTHStopping when confident
The mitigation isn’t “think less universally.” It’s “think as much as the problem warrants.” Three approaches in the literature:
- Confidence-gated stopping. Train the model to predict its own confidence at intermediate steps. Stop when confidence passes a threshold.
- Length budgets per task. Classify the incoming problem (easy/medium/hard) and budget chain length accordingly. Either a separate classifier or a learned signal from the model itself.
- Self-consistency over short chains. Instead of one long chain, sample several short chains and majority-vote. Often beats one long chain at the same token cost.
§ 04 · WHEN YOU ACTUALLY WANT MORE THOUGHTWhere the long chains earn their keep
Don’t infer from “easy problems get worse with longer chains” that you should always use short chains. Tasks where long thinking genuinely helps:
- Olympiad math / IMO-style problems.Each step requires real reasoning; you can’t skip ahead.
- Complex code with edge cases. The model needs to enumerate cases, write tests in its head, debug its own intermediate code.
- Multi-step planning. When the answer is a plan, not a fact.
- Adversarial / trick questions. Where the obvious answer is wrong and the model needs to notice.
The user’s job (in a production system): match the thinking-length budget to the task type. Send most queries with a short budget; reserve long thinking for cases that need it. Tooling for this is still rough — the next generation of reasoning models will likely auto-tune.
§ 05 · TAKING THIS FORWARDPractical guidance
Three practical moves for application teams:
- Test your prompts on a reasoning model with and without CoT enabled (or with very short vs long budgets). If short is consistently better, route around the long-thinking variant.
- On simple, common queries, use a smaller non-reasoning model. Save the reasoning model for the cases where it earns its cost.
- If your provider exposes a thinking-budget parameter (Anthropic does, OpenAI does for some o-models), tune it task-by-task in your evals.
§ · GOING DEEPERWhen extra thinking hurts and self-consistency helps
Chen et al. (2024) — “Do Not Think That Much: O1-Like Models May Overthink” — documented the phenomenon: on easy problems, reasoning models generate long chains that introduce mistakes the direct-answer baseline avoids. Reasoning models are tuned to alwaysthink, and that’s wasteful and occasionally counterproductive on problems where the answer is immediate.
The two practical mitigations: self-consistency(Wang et al. 2022) — sample several short chains and majority vote — often outperforms one long chain and lets you cap token budget. And scaling test-time compute(Snell et al. 2024) — explicitly allocate inference budget per problem difficulty, with adaptive cutoffs — gets you the gains of long reasoning without paying for it on easy cases. Frontier models increasingly do this internally.
§ · FURTHER READINGReferences & deeper sources
- (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models · ICLR
- (2024). Do Not Think That Much: O1-Like Models May Overthink · arXiv
- (2023). Self-Refine: Iterative Refinement with Self-Feedback · NeurIPS
- (2024). Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters · arXiv
- (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models · NeurIPS
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.