Coherence
A model can write fluent sentences forever, but does the third paragraph still agree with the first? Coherence is the property of being internally consistent across an extended output — and one of the more stubborn deficiencies in long-form LLM generation.
The five-bullet version
- Coherence = internal consistency across an extended generation. Different from accuracy (truth) and sensitivity (consistency across inputs).
- Failure modes: contradicting earlier statements, character name drift, plot inconsistencies, code that uses two different variable names for the same thing.
- Mechanisms: the model attends over its own generated context. Early errors compound; the model commits to early choices that may be wrong.
- Hard to measure automatically — usually requires LLM-as-judge or human evaluators.
- Mitigations: structured outlines, explicit memory, RL on consistency, self-revision passes.
§ 00 · COHERENCE IS NOT ACCURACYA separate axis of quality
Accuracy asks: is the answer correct? Coherence asks: does the answer agree with itself? A model can produce an accurate sentence and then immediately produce a second sentence that contradicts the first. Both can be individually plausible. Together they make a mess.
Coherencecoherence. The property of being internally consistent across an extended generation. A coherent output doesn't contradict itself; names, facts, and conclusions stay stable. A separate axis from accuracy (correctness) and sensitivity (input variation). is especially visible in long-form tasks:
- Stories where character names drift halfway through.
- Code where the same function is referenced by two different names.
- Reports where the conclusion contradicts the data section.
- Legal documents where defined terms are used inconsistently.
§ 01 · WHERE LONG GENERATIONS BREAKCommon drift modes
Long generations break in predictable ways:
- Name drift.“Alice” in the opening becomes “Anna” in chapter 3.
- Fact drift.The model commits to a fact early (“the meeting is at 3pm”) and contradicts it later (“the 4pm meeting…”).
- Tone shift. Formal opening, casual middle, formal close. The model drifts toward the median voice during long generation.
- Conclusion non-sequitur.The bottom line doesn’t follow from what came before. A common failure on essay-length outputs.
- Structural disagreement. The outline promised X sections; the body produced Y.
§ 02 · MECHANISMS BEHIND DRIFTWhy models forget what they just said
The autoregressive setup is part of the cause. The model attends over its own generated context — every later token sees everything that came before. So in principle, the model has full access to what it’s said. In practice, several things degrade:
- Attention dilution. As the context grows, each early token gets less attention per step. By token 5,000, the token at position 50 has stiff competition for attention weight.
- Lost in the middle. Information in the middle of the context is the first to be deprioritized. Long generations drift their own middle.
- Compounding small errors.A minor inconsistency in paragraph 2 nudges generation in paragraph 3 further off-track. Errors don’t cancel; they accumulate.
- Surface-level vs structural commitments. The model tracks surface form (the next token) more reliably than abstract commitments (the name we chose for this character).
§ 03 · HOW TO MEASURE COHERENCEAutomatic vs human
Unlike accuracy (correct vs not) or sensitivity (same answer across paraphrases), coherence is hard to automate. Most metrics in the literature use one of three approaches:
- LLM-as-judge. Show the generation to a different (usually stronger) model and ask it to rate internal consistency on a rubric. Cheap, fast, biased.
- Fact-extraction pipelines. Extract claims from the generation. Check for contradictions among the extracted claims. Hard to do reliably; most attempts have noisy signal.
- Human evaluation. Annotators rate consistency on rubrics. Slow and expensive, but the ground truth for now.
§ 04 · PRACTICAL MITIGATIONSWhat helps in production
- Outline first, then write. Have the model produce a structured outline, then write the document against it. The outline acts as an explicit memory the model refers back to.
- Chunked generation with explicit recap. For very long outputs, generate one section at a time, prepending a recap of established facts.
- Self-revision pass. After generating, ask the model to read its own output and flag inconsistencies. Then ask it to fix them. Two-pass writing is markedly more coherent.
- Structured state.Maintain a sidecar JSON of established facts (names, dates, numbers). Inject the JSON into every section’s prompt.
- Reasoning models for long-form. The explicit reasoning step acts as a planning layer that helps with structural coherence.
§ 05 · TAKING THIS FORWARDAdjacent failure modes
Coherence sits in a family of robustness properties along with sensitivity (input perturbations) and calibration (knowing what you don’t know). All three are active research areas; none is fully solved by current frontier models. Practical applications benefit most from building tooling and evals that surface these failures early.
§ · GOING DEEPERLong-form generation, factuality, and structured outputs
Coherent long-form generation requires that the model maintain consistent facts, characters, and arguments across thousands of tokens. Long-form factuality benchmarks (Bohnet et al. 2024) measure how often atomic claims in a generated text are factually correct. The numbers are sobering — even frontier models get a meaningful fraction wrong, and the errors compound across paragraphs.
Three practical approaches that work. Retrieval grounding: don’t ask the model to remember facts, give them. Structured generation(function-call schemas, JSON mode, grammar constraints): forces consistency within each output. Iterative outline-then-write (Yang et al. 2022, Re3): generate a structural outline first, then fill in each section in a separate pass with the outline as context. The lost-in-the-middle problem (Liu et al. 2023) is the root cause of much long-form drift; mitigating it directly is most of the battle.
§ · FURTHER READINGReferences & deeper sources
- (2023). Lost in the Middle: How Language Models Use Long Contexts · TACL
- (2024). Long-form factuality in large language models · arXiv
- (2023). FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation · EMNLP
- (2022). Re3: Generating Longer Stories With Recursive Reprompting and Revision · EMNLP
- (2023). Extending Context Window of Large Language Models via Position Interpolation · arXiv
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.