Context Engineering
Prompt engineering asked: how do I phrase the request? Context engineering asks the more important question: what goes into the window in the first place?
The five-bullet version
- An LLM’s answer is determined by what’s in its context, period. Curating the context is the engineering job.
- A typical request budgets across system prompt, conversation history, retrieved docs, and a reserve for the answer.
- More context isn’t always better — relevant tokens compete with noise tokens, and the “middle” of long contexts gets ignored.
- The big choices: what to retrieve, in what order, in what form (raw vs summarized), with what metadata.
- Production-grade context is structured — sections with tags, not a blob of prose.
§ 00 · CONTEXT IS THE NEW PROMPTWhy the framing shifted
Prompt engineering used to be a craft of phrasing — find the magic incantation, name the role, append “step by step.” That craft is increasingly automated by stronger models. What replaced it is context engineeringcontext engineering. The practice of curating the information available to a model at inference time. Includes choice of system prompt, retrieved documents, conversation summarization, ordering, and metadata. The successor framing to 'prompt engineering' as models got better at parsing arbitrary phrasings. — the deliberate curation of everything a model sees on a given request.
The mental model: an LLM with a 128k-token context window is a very smart assistant that can read 128k tokens in a few seconds, then answer your question. Your job is to put the right 128k tokens in front of it.
§ 01 · THE TOKEN BUDGET, DRAWNWhere the window actually goes
Every real application breaks the context window into four parts:
- System prompt. Role, instructions, output schema, guardrails. Stable across requests. Usually 200–2,000 tokens.
- Conversation history. Prior turns in the current session. Grows with each turn.
- Retrieved / injected content. RAG chunks, tool outputs, attached documents. Often the largest segment in production systems.
- Output budget.Tokens reserved for the model’s response. You don’t put anything here, but the model will refuse the request if there’s no room left for it to answer.
§ 02 · WHAT TO PUT IN, WHAT TO LEAVE OUTRelevance beats volume
The temptation, especially with long-context models, is to put everythingin — “the model will figure out what’s relevant.” This consistently produces worse answers than a carefully curated context.
Two reasons:
- Irrelevant tokens compete for attention. Even when the model nominally has 128k tokens of context, the operation attention is the same one as in a 4k window — every position is looking at every other position. Adding noise dilutes the signal.
- The middle gets ignored.Empirical work since 2023 repeatedly shows that information in the middle of a long context gets less attention than information at the start or the end. Dropping a critical fact at position 60,000 of a 128,000 context can make the model behave as if it weren’t there.
§ 03 · PACKING, ORDERING, AND THE MIDDLE PROBLEMLayout matters
Given you’ve decided what content to include, three sub-choices decide whether the model actually uses it:
- Order. Most-relevant first (or last) when possible. For chronological data, the model usually performs better with recent items closer to the question.
- Structure. Wrap context blocks in clear delimiters (
<context>…</context>, headings, or numbered chunks). Structure helps the model parse and cite. - Metadata. Tag each chunk with its source, date, or trust level. Useful for two things: helping the model resolve conflicts, and producing citations.
§ 04 · WHEN TO SUMMARIZE, WHEN TO FETCHCompressing history vs replacing it
Two competing strategies for handling long-running sessions:
- Summarize. Periodically compress old turns into a summary, drop the originals. Keeps context window manageable; loses fidelity on details the summary skipped.
- Retrieve. Keep originals in an external store; pull back the relevant ones per request via search. No information loss, but adds latency and infrastructure.
The hybrid that works best in production: keep a rolling summary of the conversation up to the last N turns, keep the last N turns verbatim, and retrieve from older history (or external knowledge) on demand.
§ 05 · TAKING THIS FORWARDWhat the rest of the stack looks like
Context engineering sits between prompt engineering (phrasing) and RAG (retrieval). RAG, advanced RAG, and agentic patterns are all techniques for generating the context. Context engineering is about deciding what to do with the candidate context once you have it: order it, trim it, structure it, summarize it, decide what gets the budget.
§ · GOING DEEPERLost in the middle and what to do about it
Liu et al. (2023) — “Lost in the Middle” — documented the phenomenon: LLMs reliably use information at the start and end of their context, and unreliably use information in the middle. The effect is robust across models, including frontier ones, though it gets weaker as models scale and as training explicitly includes long-context examples.
Three practical responses. Place important context last(just before the question) — that’s the recency sweet spot. Chunk and retrieve rather than dumping everything in context: a focused 5k-token slice beats a noisy 50k-token slice almost every time. Andstructure helps — clear headers, XML tags, numbered sections — because they give the model anchors to attend to. RULER (Hsieh et al. 2024) is the benchmark that tracks how well models actually use their long context windows.
§ · FURTHER READINGReferences & deeper sources
- (2023). Lost in the Middle: How Language Models Use Long Contexts · TACL
- (2020). Language Models are Few-Shot Learners (in-context learning) · NeurIPS
- (2024). RULER: What's the Real Context Size of Your Long-Context Language Models? · arXiv
- (2024). Many-Shot In-Context Learning · arXiv
- (2024). Prompt engineering for long context · Anthropic Documentation
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.