Latest Research · Module 33·7 min read

Coherence

A model can write fluent sentences forever, but does the third paragraph still agree with the first? Coherence is the property of being internally consistent across an extended output — and one of the more stubborn deficiencies in long-form LLM generation.

The five-bullet version

  • Coherence = internal consistency across an extended generation. Different from accuracy (truth) and sensitivity (consistency across inputs).
  • Failure modes: contradicting earlier statements, character name drift, plot inconsistencies, code that uses two different variable names for the same thing.
  • Mechanisms: the model attends over its own generated context. Early errors compound; the model commits to early choices that may be wrong.
  • Hard to measure automatically — usually requires LLM-as-judge or human evaluators.
  • Mitigations: structured outlines, explicit memory, RL on consistency, self-revision passes.

§ 00 · COHERENCE IS NOT ACCURACYA separate axis of quality

Accuracy asks: is the answer correct? Coherence asks: does the answer agree with itself? A model can produce an accurate sentence and then immediately produce a second sentence that contradicts the first. Both can be individually plausible. Together they make a mess.

Coherencecoherence. The property of being internally consistent across an extended generation. A coherent output doesn't contradict itself; names, facts, and conclusions stay stable. A separate axis from accuracy (correctness) and sensitivity (input variation). is especially visible in long-form tasks:

§ 01 · WHERE LONG GENERATIONS BREAKCommon drift modes

Long generations break in predictable ways:

§ 02 · MECHANISMS BEHIND DRIFTWhy models forget what they just said

The autoregressive setup is part of the cause. The model attends over its own generated context — every later token sees everything that came before. So in principle, the model has full access to what it’s said. In practice, several things degrade:

§ 03 · HOW TO MEASURE COHERENCEAutomatic vs human

Unlike accuracy (correct vs not) or sensitivity (same answer across paraphrases), coherence is hard to automate. Most metrics in the literature use one of three approaches:

coherencegeneration length (tokens)02000400060008000Reasoning modelFrontier chatSmall model
Fig 1Coherence vs length. Long generations are hard. Reasoning models seem to do better — possibly because their explicit reasoning anchors the output.

§ 04 · PRACTICAL MITIGATIONSWhat helps in production

CHECKYou're using an LLM to generate 5,000-word case studies. Readers report the conclusion sometimes contradicts the body. Best fix?

§ 05 · TAKING THIS FORWARDAdjacent failure modes

Coherence sits in a family of robustness properties along with sensitivity (input perturbations) and calibration (knowing what you don’t know). All three are active research areas; none is fully solved by current frontier models. Practical applications benefit most from building tooling and evals that surface these failures early.

§ · GOING DEEPERLong-form generation, factuality, and structured outputs

Coherent long-form generation requires that the model maintain consistent facts, characters, and arguments across thousands of tokens. Long-form factuality benchmarks (Bohnet et al. 2024) measure how often atomic claims in a generated text are factually correct. The numbers are sobering — even frontier models get a meaningful fraction wrong, and the errors compound across paragraphs.

Three practical approaches that work. Retrieval grounding: don’t ask the model to remember facts, give them. Structured generation(function-call schemas, JSON mode, grammar constraints): forces consistency within each output. Iterative outline-then-write (Yang et al. 2022, Re3): generate a structural outline first, then fill in each section in a separate pass with the outline as context. The lost-in-the-middle problem (Liu et al. 2023) is the root cause of much long-form drift; mitigating it directly is most of the battle.

§ · FURTHER READINGReferences & deeper sources

  1. Liu et al. (2023). Lost in the Middle: How Language Models Use Long Contexts · TACL
  2. Bohnet et al. (2024). Long-form factuality in large language models · arXiv
  3. Min et al. (2023). FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation · EMNLP
  4. Yang et al. (2022). Re3: Generating Longer Stories With Recursive Reprompting and Revision · EMNLP
  5. Chen et al. (2023). Extending Context Window of Large Language Models via Position Interpolation · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.