Conversation as Working Memory

One-Line Summary: The message history in an agent loop functions as working memory, accumulating context that shapes every subsequent reasoning step and tool invocation.

Prerequisites: ../03-the-reasoning-core/agent-loop-observe-think-act.md, ../01-agent-architecture-foundations/anatomy-of-an-agent.md

What Is Conversation as Working Memory?

In cognitive science, working memory is the small, active store where a human holds the information they are currently manipulating. For an LLM-based agent, the conversation history serves an almost identical role. Every message — whether it is a user request, an assistant reasoning step, or a tool result — becomes part of the prompt that the model reads on the next iteration. The agent does not have a separate RAM; the message list is the RAM.

This means the quality of an agent's decisions is directly coupled to what sits in its message history. A clean, focused history produces sharp reasoning. A bloated, contradictory history produces confused outputs. Unlike a traditional program where you can inspect variables in a debugger, an agent's "variables" are scattered across natural-language messages that the model must parse anew on every call.

Understanding this relationship is the single most important insight for building reliable multi-step agents. Once you see the conversation as working memory, design decisions about message formatting, tool output length, and summarization strategies all follow naturally.

How It Works

The Message List as a Data Structure

Under the hood, every agent framework maintains an ordered list of message objects. In LangGraph and LangChain, these are typically HumanMessage, AIMessage, and ToolMessage instances. Each time the agent loop iterates, the full list is serialized into the LLM prompt. The model reads the entire sequence, generates a response, and that response is appended to the list before the next iteration begins.

from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
 
# A simplified view of what accumulates during an agent run
messages = [
    HumanMessage(content="Find the top 3 Python web frameworks by GitHub stars"),
    AIMessage(content="", tool_calls=[{"name": "web_search", "args": {"query": "..."}}]),
    ToolMessage(content="Results: Flask (67k), Django (79k), FastAPI (75k)..."),
    AIMessage(content="", tool_calls=[{"name": "get_repo_details", "args": {"repo": "django"}}]),
    ToolMessage(content="Django: 79,421 stars, 1,247 contributors..."),
    AIMessage(content="Based on my research, the top 3 frameworks are..."),
]

Each entry carries metadata: a role, optional tool call IDs, and the content itself. The model uses positional information — earlier messages set context, later messages refine it. This is why the order of messages matters enormously.

How Context Accumulates in the Agent Loop

Every iteration of the observe-think-act loop adds at least one message, and often two (an assistant message with a tool call, then a tool result message). After five tool calls, the history has grown by roughly ten messages. After twenty tool calls, the history can contain fifty or more messages, each consuming tokens from the finite context window.

This accumulation is both a strength and a weakness. The strength is that the agent can refer back to earlier findings without re-fetching them. The weakness is that irrelevant or redundant information competes for the model's attention. Research on LLM behavior shows that models tend to pay more attention to the beginning and end of the context (the "lost in the middle" phenomenon), which means critical information buried in the middle of a long history may be partially ignored.

# Token growth over iterations (approximate)
# Iteration 1: ~200 tokens (user query + first response)
# Iteration 3: ~1,200 tokens (3 tool calls with results)
# Iteration 7: ~3,500 tokens (complex multi-step task)
# Iteration 15: ~8,000 tokens (long research task)
# Iteration 30: ~18,000 tokens (approaching limits for smaller models)

Tool Results Append as Messages

When a tool executes, its output is appended to the message history as a ToolMessage. This is the primary mechanism by which an agent "remembers" what it has learned. The agent does not store tool results in a database or a variable — it stores them in the conversation. This design has a critical implication: the format and length of tool results directly affect reasoning quality.

A tool that returns a 2,000-token JSON blob consumes ten times the context of a tool that returns a 200-token summary. If the agent calls that tool five times, the difference is 10,000 tokens versus 1,000 tokens — potentially the difference between a successful run and a context overflow.

# Poor: raw tool output consumes excessive working memory
ToolMessage(content='{"results": [{"title": "...", "url": "...", "snippet": "...", '
                     '"date": "...", "source": "...", "metadata": {...}}, ...]}')
 
# Better: structured, concise tool output preserves working memory
ToolMessage(content="Top 3 results:\n1. Django (79k stars) - Full-stack framework\n"
                    "2. FastAPI (75k stars) - Async API framework\n"
                    "3. Flask (67k stars) - Lightweight micro-framework")

Strategies for Keeping the Conversation Focused

Unchecked accumulation turns the message history into noise. Several strategies help keep the conversation sharp and useful for reasoning.

Strategy 1: Summarize tool results before appending them

def summarize_tool_result(raw_result: str, tool_name: str) -> str:
    """Condense large tool outputs before they enter working memory."""
    if len(raw_result) > 2000:
        summary = llm_fast.invoke(
            f"Summarize this {tool_name} result in under 200 words, "
            f"keeping all key data points:\n\n{raw_result}"
        )
        return f"[Summarized] {summary.content}"
    return raw_result

Strategy 2: Insert checkpoint messages after completing sub-tasks

# After completing a sub-task, inject a summary marker
messages.append(AIMessage(content=(
    "[CHECKPOINT] Sub-task complete: Retrieved Q4 revenue data. "
    "Key finding: Widget Pro leads at $524K. "
    "Raw data above can be skimmed — this summary captures the essential result."
)))

Strategy 3: Instruct the agent to stay focused in the system prompt

FOCUS_INSTRUCTION = """
Stay focused on the user's original request. If you discover related
issues while working, note them but do not investigate unless they
block the current task. Side quests waste context space and dilute
your working memory.
"""

Strategy 4: Limit what enters working memory in the first place

# BAD: Appending a 10,000-line log file as a tool result
messages.append(ToolMessage(content=full_log_file))  # 50,000 tokens!
 
# GOOD: Appending only the relevant lines
messages.append(ToolMessage(content=(
    f"[Log file: 10,247 lines. Showing 12 relevant error lines]\n"
    f"{relevant_lines}"
)))

Why It Matters

The Conversation IS the Agent's Brain

Unlike traditional software where memory is explicitly managed through variables, databases, and caches, an LLM agent's memory is implicitly managed through the conversation. If information is not in the conversation, the agent does not know it. If the conversation is cluttered, the agent's reasoning is impaired. This is the foundation for understanding every other memory-related concept in agent design.

Conversation Length Directly Affects Reasoning Quality

Empirical testing shows a clear relationship between conversation length and output quality. Short, focused conversations under 4,000 tokens tend to produce precise, well-reasoned responses. As conversations grow beyond 8,000-10,000 tokens, models begin to exhibit drift — repeating earlier steps, contradicting prior conclusions, or losing track of the original goal. Beyond 20,000 tokens, even frontier models show measurable degradation in multi-step reasoning accuracy. This is not a theoretical concern; any research or coding agent that performs more than a handful of tool calls will routinely cross these thresholds.

Key Technical Details

The message list is the only working memory an LLM agent has by default — there are no hidden registers or caches
Each agent loop iteration typically adds 2 messages: one AIMessage (with tool calls) and one ToolMessage (with results)
Token consumption per tool call varies widely: simple lookups produce 50-200 tokens, web searches produce 300-800 tokens, file reads can produce 1,000-5,000 tokens
The "lost in the middle" effect means information at positions 40-60% through the context receives the least model attention
LangGraph's default message handling appends all messages; no pruning occurs unless explicitly configured
Tool result formatting is the single highest-leverage optimization — a 10x reduction in tool output size translates directly to 10x more room for reasoning
Conversation history of 5 tool calls at 500 tokens each = 2,500 tokens of working memory consumed before any reasoning tokens are counted
Models with 128k context windows still show reasoning degradation beyond approximately 30k tokens of conversation history
The system prompt (1,000-3,000 tokens) is a fixed cost present in every call, reducing the space available for conversation

Common Misconceptions

"The agent remembers things between calls like a program remembers variables": The agent has no persistent memory between LLM calls. It re-reads the entire conversation from scratch on every iteration. What looks like "remembering" is actually re-processing the full message history each time.

"Longer context windows solve the working memory problem": Larger context windows help, but they do not eliminate the issue. Reasoning quality degrades with conversation length regardless of the window size. A 128k-token window does not mean 128k tokens of effective working memory — practical effective memory is much smaller.

"Tool results are stored separately and retrieved on demand": In standard agent architectures, tool results are inline in the message history. There is no lazy loading or on-demand retrieval. Every tool result is re-read by the model on every subsequent iteration, consuming tokens whether or not the information is still relevant.

Connections to Other Concepts

structured-state-management.md — When conversation-as-memory becomes insufficient, explicit state structures provide an alternative
context-window-pressure.md — The direct consequence of unbounded conversation growth
../03-the-reasoning-core/agent-loop-observe-think-act.md — The loop that generates and consumes conversation history
persistent-memory-across-sessions.md — Extending memory beyond a single conversation