Latest Research · Module 30·7 min read

Overthinking

Reasoning models like o1 and R1 sometimes reason themselves into the wrong answer. More tokens, more chances to get lost. The fix is subtler than “think less.”

The five-bullet version

  • Modern reasoning models spend hundreds to thousands of tokens thinking before answering.
  • For hard problems this is a win. For easy ones, it can hurt — the model talks itself out of a correct first instinct.
  • Empirically: accuracy vs chain length is U-shaped on many tasks. Past the sweet spot, longer = worse.
  • The failure mode looks like: confident wrong answer flips to wrong-different answer, then to elaborate justification.
  • Mitigation: train the model to stop thinking when confident; budget chain length; provide hard early stops.

§ 00 · MORE THINKING, MORE WRONG?The counterintuitive finding

When o1 and R1 first showed that longer chains of thought produce better answers on math and code, the conclusion seemed clean: more thinking = more right. The 2025 research literature has complicated this.

On hard problems, more thinking does help — up to a point. On problems the model would have gotten right in 50 tokens, asking for 2,000 tokens of reasoning sometimes makes it worse. The model has time to second-guess, introduce errors, talk itself into the wrong answer.

This is the overthinkingoverthinking. A documented failure mode in reasoning models: when given (or trained to use) excessive chain-of-thought length, the model degrades on problems it would have solved correctly with shorter reasoning. The chain introduces errors that compound. problem. Length isn’t free; it’s a hyperparameter.

§ 01 · THE EMPIRICAL CURVEAccuracy vs chain length, by task

Across recent reasoning-model benchmarks, the picture looks roughly:

accuracychain length (tokens)0100025005000EasyMediumHard
Fig 1Accuracy vs reasoning length. The right amount of thinking is task-dependent — reasoning models that always emit max-length chains pay a tax on easy problems.

§ 02 · WHY THIS HAPPENSThree contributing mechanisms

§ 03 · CALIBRATING THINKING LENGTHStopping when confident

The mitigation isn’t “think less universally.” It’s “think as much as the problem warrants.” Three approaches in the literature:

§ 04 · WHEN YOU ACTUALLY WANT MORE THOUGHTWhere the long chains earn their keep

Don’t infer from “easy problems get worse with longer chains” that you should always use short chains. Tasks where long thinking genuinely helps:

The user’s job (in a production system): match the thinking-length budget to the task type. Send most queries with a short budget; reserve long thinking for cases that need it. Tooling for this is still rough — the next generation of reasoning models will likely auto-tune.

CHECKA QA team finds their reasoning model gets simple factual questions wrong 5% of the time, but only 1% wrong if forced to skip CoT entirely. What's likely happening?

§ 05 · TAKING THIS FORWARDPractical guidance

Three practical moves for application teams:

§ · GOING DEEPERWhen extra thinking hurts and self-consistency helps

Chen et al. (2024) — “Do Not Think That Much: O1-Like Models May Overthink” — documented the phenomenon: on easy problems, reasoning models generate long chains that introduce mistakes the direct-answer baseline avoids. Reasoning models are tuned to alwaysthink, and that’s wasteful and occasionally counterproductive on problems where the answer is immediate.

The two practical mitigations: self-consistency(Wang et al. 2022) — sample several short chains and majority vote — often outperforms one long chain and lets you cap token budget. And scaling test-time compute(Snell et al. 2024) — explicitly allocate inference budget per problem difficulty, with adaptive cutoffs — gets you the gains of long reasoning without paying for it on easy cases. Frontier models increasingly do this internally.

§ · FURTHER READINGReferences & deeper sources

  1. Wang et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models · ICLR
  2. Chen et al. (2024). Do Not Think That Much: O1-Like Models May Overthink · arXiv
  3. Madaan et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback · NeurIPS
  4. Snell et al. (2024). Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters · arXiv
  5. Wei et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models · NeurIPS

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.