Core Concepts · Module 13·9 min read

Context Engineering

Prompt engineering asked: how do I phrase the request? Context engineering asks the more important question: what goes into the window in the first place?

The five-bullet version

  • An LLM’s answer is determined by what’s in its context, period. Curating the context is the engineering job.
  • A typical request budgets across system prompt, conversation history, retrieved docs, and a reserve for the answer.
  • More context isn’t always better — relevant tokens compete with noise tokens, and the “middle” of long contexts gets ignored.
  • The big choices: what to retrieve, in what order, in what form (raw vs summarized), with what metadata.
  • Production-grade context is structured — sections with tags, not a blob of prose.

§ 00 · CONTEXT IS THE NEW PROMPTWhy the framing shifted

Prompt engineering used to be a craft of phrasing — find the magic incantation, name the role, append “step by step.” That craft is increasingly automated by stronger models. What replaced it is context engineeringcontext engineering. The practice of curating the information available to a model at inference time. Includes choice of system prompt, retrieved documents, conversation summarization, ordering, and metadata. The successor framing to 'prompt engineering' as models got better at parsing arbitrary phrasings. — the deliberate curation of everything a model sees on a given request.

The mental model: an LLM with a 128k-token context window is a very smart assistant that can read 128k tokens in a few seconds, then answer your question. Your job is to put the right 128k tokens in front of it.

§ 01 · THE TOKEN BUDGET, DRAWNWhere the window actually goes

Every real application breaks the context window into four parts:

Lab · context budget32k-token window, divvied up
System prompt800
Conversation history4,500
Retrieved docs8,000
Output (reserved)1,200
OK · used 14,500 / 32,000 (17,500 tokens of slack)

§ 02 · WHAT TO PUT IN, WHAT TO LEAVE OUTRelevance beats volume

The temptation, especially with long-context models, is to put everythingin — “the model will figure out what’s relevant.” This consistently produces worse answers than a carefully curated context.

Two reasons:

§ 03 · PACKING, ORDERING, AND THE MIDDLE PROBLEMLayout matters

Given you’ve decided what content to include, three sub-choices decide whether the model actually uses it:

retrieval accuracyposition of relevant chunk in context (0 = start, 100 = end)the middle is invisibleprimacyrecency
Fig 1The lost-in-the-middle U. Holds across most frontier models, with different magnitudes. Place the chunk that matters most at a position the model is likely to read.

§ 04 · WHEN TO SUMMARIZE, WHEN TO FETCHCompressing history vs replacing it

Two competing strategies for handling long-running sessions:

The hybrid that works best in production: keep a rolling summary of the conversation up to the last N turns, keep the last N turns verbatim, and retrieve from older history (or external knowledge) on demand.

CHECKYour chatbot answers from a 200k-token knowledge base. You're using a 128k-context model. The retrieved chunks fit easily but the model frequently misses information that's clearly in the retrieved context. Most likely cause?

§ 05 · TAKING THIS FORWARDWhat the rest of the stack looks like

Context engineering sits between prompt engineering (phrasing) and RAG (retrieval). RAG, advanced RAG, and agentic patterns are all techniques for generating the context. Context engineering is about deciding what to do with the candidate context once you have it: order it, trim it, structure it, summarize it, decide what gets the budget.

§ · GOING DEEPERLost in the middle and what to do about it

Liu et al. (2023) — “Lost in the Middle” — documented the phenomenon: LLMs reliably use information at the start and end of their context, and unreliably use information in the middle. The effect is robust across models, including frontier ones, though it gets weaker as models scale and as training explicitly includes long-context examples.

Three practical responses. Place important context last(just before the question) — that’s the recency sweet spot. Chunk and retrieve rather than dumping everything in context: a focused 5k-token slice beats a noisy 50k-token slice almost every time. Andstructure helps — clear headers, XML tags, numbered sections — because they give the model anchors to attend to. RULER (Hsieh et al. 2024) is the benchmark that tracks how well models actually use their long context windows.

§ · FURTHER READINGReferences & deeper sources

  1. Liu et al. (2023). Lost in the Middle: How Language Models Use Long Contexts · TACL
  2. Brown et al. (2020). Language Models are Few-Shot Learners (in-context learning) · NeurIPS
  3. Hsieh et al. (2024). RULER: What's the Real Context Size of Your Long-Context Language Models? · arXiv
  4. Anil et al. (2024). Many-Shot In-Context Learning · arXiv
  5. Anthropic (2024). Prompt engineering for long context · Anthropic Documentation

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.