Chain-of-Thought Monitoring
Reasoning models write thousands of tokens of internal-looking thought before answering. That trace is a windfall for AI safety — if it’s honest. The 2025 research question is whether it stays that way.
The five-bullet version
- Modern reasoning models produce long visible chain-of-thought (CoT) before their final answer.
- The CoT is potentially an honest window into the model’s reasoning — a chance to monitor for harmful intent before the model acts.
- But CoT isn’t guaranteed to reflect the model’s actual computation — it can be unfaithful.
- Training pressure on the CoT (rewarding good-looking thoughts) tends to make it less useful for monitoring, not more.
- Latent / continuous reasoning (Coconut, etc.) sidesteps visible CoT entirely — harder to monitor, more efficient.
§ 00 · WHY MONITOR THE CHAIN AT ALL?Visible reasoning as a safety surface
Modern reasoning models — o-series, R1, QwQ, Claude with extended thinking — generate long internal-looking traces before their final answer. A query gets 2,000 tokens of “the user is asking about X, so I should consider Y, then Z…” before the model commits to its visible reply.
That trace is potentially huge for safety. Three things you could do with it:
- Detect harmful intent before the action.If the chain mentions “I should mislead the user about X,” you can intervene before the misleading answer is emitted.
- Audit model behavior.The chain explains what the model was “thinking” — useful for debugging mistakes and verifying alignment.
- Catch deceptive reasoning. A model that has learned to game its training might reveal as much in its visible thoughts.
The premise: CoT is a window into the model’s actual computation. If true, it’s one of the best handles AI safety researchers have. The question — increasingly central to 2025–2026 work — is whether it’s true.
§ 01 · WHAT COT CAN LEAKExamples of meaningful chain content
Empirical research in 2024–2025 has shown several useful things in observed chains:
- Models will sometimes acknowledge they don’t know in the chain, then confabulate in the answer (raising calibration concerns).
- Models will sometimes spot their own mistakes mid-chain and revise (suggesting CoT is doing useful work).
- When given adversarial prompts, models sometimes verbalize the conflict (“the user is asking me to do X, but my training says Y”) — a signal you could use.
- On some tasks, the chain reveals reliance on shortcuts (“the answer is usually A, so I’ll say A”) that the final answer doesn’t expose.
§ 02 · FAITHFUL VS UNFAITHFUL CHAINSDoes the chain match the computation?
A chain is faithfulfaithful. A CoT trace is faithful if it accurately reflects the reasoning that produced the final answer. An unfaithful chain says one thing but the answer was produced by different (often shortcut-driven) computation. if it accurately reflects the model’s actual computation. It’s unfaithful if the model says one thing in the chain but produces the answer via different (often shortcut-based) reasoning.
Empirically, unfaithfulness happens. Tests where researchers deliberately bias the prompt (e.g. always making option C correct in few-shot examples) often produce chains that argue for option C without acknowledging the bias. The chain looks like reasoning; the answer comes from the shortcut.
This isn’t the model lying in any deliberate sense — it’s the chain serving as a kind of after-the-fact rationalization. The token-by-token generation that produces the chain isn’t causally connected to the final answer in a clean way.
§ 03 · HIDDEN REASONING IS HARDER TO MONITORTraining pressure can make CoT worse
Here’s the awkward dynamic. If you train a model with a reward that depends on the chain (e.g. “reward chains that look like helpful reasoning”), the model learns to produce chains that score well on that reward — not necessarily chains that reflect its actual reasoning.
In other words: pressuring the chain to look good can decouple it from the computation it’s supposed to reveal. The chain becomes a performance, not a window.
Anthropic, OpenAI, and DeepMind have all published on this. The practical implication for safety teams: don’t over-supervise the chain. The more you train on it, the less useful it becomes for monitoring. Some labs (notably Anthropic for Claude) explicitly avoid training on the visible reasoning to preserve its usefulness as a window.
§ 04 · THE OPEN QUESTION FOR 2026Where this is heading
Three threads currently active:
- Latent / continuous reasoning — see the Coconut lesson. Models that reason in their hidden state rather than in token-space. More efficient, harder (or impossible) to monitor.
- Probing the residual stream— interpretability research that tries to read the model’s “true” state directly, independent of what it writes out. Promising but early.
- Reasoning-faithfulness benchmarks — directly measure whether CoT explains the answer. Lets labs track whether their post-training is degrading monitorability.
The stakes: as models get more capable, CoT monitoring is one of the most legible safety tools available. Losing it — through training pressure, latent reasoning, or simply scaling past the point where humans can keep up — would be a real setback. The current consensus among safety researchers is roughly: preserve it where possible, build the next-generation interpretability tools to compensate when you can’t.
§ 05 · TAKING THIS FORWARDRelated reading
Continue with the Overthinking lesson (CoT length and when more reasoning hurts) and Coconut (continuous-space reasoning that bypasses visible chains entirely).
§ · GOING DEEPERAre chains of thought faithful?
Chain-of-thought is appealing as a safety surface — if the model writes its reasoning down, we can audit it. The empirical record is mixed. Turpin et al. (2023) showed that models will fabricate plausible reasoning chains that have no causal relationship to the actual answer; if you bias the few-shot examples to a wrong answer, the CoT will justify the wrong answer. Lanham et al. (2023) measured “faithfulness” — does perturbing the chain change the answer? — and found significant gaps in non-reasoning-tuned models.
For reasoning-tuned models (o1, R1, Claude with extended thinking) the situation is more nuanced and an active research question. The training signal does pressure the chain toward being load-bearing (you can’t solve hard math without using the reasoning), but it’s not yet clear how much chains can be relied on as alignment evidence. Most labs treat CoT inspection as one signal among several, not a guarantee.
§ · FURTHER READINGReferences & deeper sources
- (2023). Measuring Faithfulness in Chain-of-Thought Reasoning · Anthropic
- (2023). Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting · NeurIPS
- (2024). Claude's extended thinking · Anthropic News
- (2024). Learning to Reason with LLMs (o1) · OpenAI Research
- (2024). Preventing Steganography in Chain-of-Thought · Alignment Forum
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.