Coconut
Chain-of-thought reasons in words. Meta’s Coconut proposes a provocative alternative: reason in vectors. Skip the token bottleneck and let the model think in its native representation.
The five-bullet version
- Chain-of-thought forces the model to express each reasoning step as text. That’s a constraint, not a feature.
- Coconut (Chain of Continuous Thought) lets the model produce a hidden “thought” vector at each step, fed back as input — no token emitted.
- The model reasons in its native high-dimensional space, then emits tokens only at the end.
- Faster (no tokenization), denser (each step carries more info), but harder to monitor.
- Shows competitive results on reasoning tasks while using fewer “thinking” steps than text CoT.
§ 00 · TOKENS AS A REASONING BOTTLENECKWhy language might be the wrong medium for thought
Chain-of-thought has the model writing out its reasoning as tokens. Each step is a sentence or two of human-readable text. The model gains compute by generating those tokens, because each new token gets its own forward pass.
Three things this approach costs:
- Information loss per step.The model’s hidden state is high-dimensional (thousands of floats). Compressing each reasoning step into 5–20 tokens loses information. The model has to re-encode at the next step.
- Speed. Each token requires a full transformer forward pass. Long chains take real wall-clock time.
- Linguistic constraint.Reasoning has to fit into natural-language patterns. Some computations don’t have a clean linguistic expression.
§ 01 · REASONING IN CONTINUOUS SPACEThe Coconut idea
CoconutCoconut. Chain Of CONtinuous Thought. A Meta paper from late 2024 that proposes letting a language model 'think' in its hidden-state space — feeding the model's own embedding output back as input, with no token emitted — rather than forcing every reasoning step into text tokens. — Chain Of CONtinuous Thought — proposes the alternative: skip the tokenization step in the middle of reasoning. Instead of generating a token at each step, the model produces a thought vector — a continuous-valued embedding — that’s fed back as the next input.
The model emits an actual token only at the end, when it has the answer. Between the input and that answer, it does internal computation that never surfaces as text.
§ 02 · COCONUT, SPECIFICALLYImplementation in one paragraph
The training recipe: start with a standard pretrained LLM. During fine-tuning, introduce a special “thinking” mode where the model’s last hidden-state vector (the one that would normally pass through the lm_head to produce a token) is fed directly back as the next input embedding — bypassing the unembedding and re-embedding. Train the model to use this mode for reasoning steps and switch to normal token output for the final answer.
Empirically: on math word problems and logical reasoning, Coconut matches CoT-trained baselines with fewer “thinking” steps. The hidden-state thoughts are denser than text tokens, so each step does more work.
§ 03 · SPEED, CAPACITY, OPACITYThe trade-off triangle
Continuous reasoning is faster: no per-step decoding, no re-encoding of generated tokens. Bigger effective per-step capacity: thousands of float dimensions instead of one token.
The cost is opacityopacity. A model's reasoning is opaque to inspection if its internal computation can't be read directly. Continuous reasoning produces hidden-state thought vectors that humans can't interpret without probing tools — a downside compared to text-based chain-of-thought.. Hidden-state vectors are not interpretable. You can’t read a thought vector and know what the model was reasoning about. For safety teams hoping to use chain-of-thought monitoring, Coconut- style models are a step backwards — there’s nothing to monitor.
This is the tension at the heart of 2025’s reasoning-model debates. Continuous reasoning is probably better engineering. Visible reasoning is probably better safety. The two pull in opposite directions; the field hasn’t settled.
§ 04 · WHERE THIS IS HEADINGOpen questions
Three open questions as of 2026:
- Does continuous reasoning scale?Coconut showed promise on small reasoning tasks. The gains haven’t (yet publicly) been replicated at frontier-model scale.
- What does continuous reasoning even compute? Interpretability researchers are working on probes that read thought vectors. Early work suggests recognizable patterns (numbers, entities) live there — but the picture is incomplete.
- Will it ship? Frontier labs are conservative about reasoning approaches that lose monitoring. Coconut-style methods may stay academic until safety tooling catches up.
§ 05 · TAKING THIS FORWARDAdjacent reading
The Chain-of-Thought Monitoring lesson covers the safety side of the trade-off; SFT vs RL covers the post-training mechanics that produced the current reasoning-model wave. The continuous-reasoning thread is the most architecturally ambitious branch of the larger reasoning-model story.
§ · GOING DEEPERReasoning without tokens
Coconut (Hao et al. 2024 — “Chain of Continuous Thought”) is the experimental proposal that LLM reasoning needn’t be discretized into tokens. Instead of decoding the model’s thinking into text and re-embedding it, feed the last hidden state directly back into the model as the next “thought” — keep reasoning in continuous latent space, only emit tokens at the final answer.
The motivation is twofold. Speed: skipping the embed-then-decode round-trip every step is faster. Capacity: a hidden state carries more information than a single token, so each reasoning step does more work. The trade-off is interpretability — you can’t read continuous thoughts the way you can read text. Pfau et al.’s “Let’s Think Dot by Dot” (2024) and Geiping et al.’s latent-reasoning work (2025) are nearby; the field hasn’t yet settled on whether latent reasoning will compose to frontier-level capabilities.
§ · FURTHER READINGReferences & deeper sources
- (2024). Training Large Language Models to Reason in a Continuous Latent Space (Coconut) · arXiv
- (2024). Think before you speak: Training Language Models With Pause Tokens · ICLR
- (2024). Let's Think Dot by Dot: Hidden Computation in Transformer Language Models · COLM
- (2025). Scaling up Test-Time Compute with Latent Reasoning · arXiv
- (2022). Chain-of-Thought Prompting (the discrete-reasoning baseline) · NeurIPS
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.