One-Line Summary: The "messages" stream mode delivers LLM output token-by-token as (message_chunk, metadata) tuples, enabling responsive real-time chat interfaces.
Prerequisites: stream-modes.md, ../01-langgraph-foundations/what-is-langgraph.md, ../03-building-your-first-agent/prebuilt-react-agent.md
What Is Token Streaming?
Picture a news ticker scrolling across the bottom of a TV screen. Each word appears the moment it is available rather than waiting for the entire headline to be written. Token streaming works the same way -- instead of waiting for the LLM to finish its entire response, you receive each token (roughly a word or word fragment) the instant the model produces it. This turns a multi-second wait into a fluid, typewriter-like experience.
Without token streaming, a user staring at a chat interface sees nothing until the model finishes -- which can take 5 to 30 seconds for long responses. With token streaming, the first token appears in under a second, and the response builds in real time. This perceived speed is what makes modern chat applications feel responsive.
In LangGraph, token streaming is activated by using the "messages" stream mode. Every chunk the graph yields is a tuple of a message fragment and metadata describing where it came from.
How It Works
Basic Token Streaming
from langgraph.prebuilt import create_react_agent
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", streaming=True)
agent = create_react_agent(llm, tools=[])
for msg, metadata in agent.stream(
{"messages": [("user", "Explain quantum computing")]},
stream_mode="messages",
):
if msg.content:
print(msg.content, end="", flush=True)Each msg is an AIMessageChunk object containing a small piece of the response. The end="" and flush=True arguments ensure tokens print on the same line without buffering.
Understanding Metadata
The metadata dict tells you which node generated the chunk, allowing you to filter or route output.
for msg, metadata in agent.stream(input, stream_mode="messages"):
node_name = metadata.get("langgraph_node", "unknown")
# Only stream tokens from the main agent node
if node_name == "agent" and msg.content:
print(msg.content, end="", flush=True)
elif node_name == "tools":
pass # Suppress tool execution outputFiltering by Node in Multi-Node Graphs
In complex graphs with multiple LLM-calling nodes, you often want to stream from only one.
STREAM_NODES = {"writer", "summarizer"}
for msg, metadata in graph.stream(input, stream_mode="messages"):
if metadata.get("langgraph_node") in STREAM_NODES and msg.content:
yield msg.contentCollecting the Full Response
You can accumulate chunks into a complete message while streaming.
full_response = ""
for msg, metadata in agent.stream(input, stream_mode="messages"):
if msg.content and metadata.get("langgraph_node") == "agent":
full_response += msg.content
print(msg.content, end="", flush=True)
print(f"\n\nFull response length: {len(full_response)}")Tool Call Chunks
When the LLM decides to call a tool, the stream emits chunks with tool_call_chunks instead of content.
for msg, metadata in agent.stream(input, stream_mode="messages"):
if msg.content:
print(msg.content, end="", flush=True)
elif msg.tool_call_chunks:
for tc in msg.tool_call_chunks:
print(f"\n[Calling tool: {tc.get('name', '...')}]")Why It Matters
- Perceived latency drops dramatically -- Users see the first token in under a second even if the full response takes 15 seconds.
- Chat UIs feel natural -- Token-by-token rendering matches the experience users expect from modern AI assistants.
- Node-level filtering -- Metadata lets you stream selectively, showing only the final answer and hiding internal reasoning.
- Progressive rendering -- Frontends can parse partial markdown, render lists, and format code blocks as tokens arrive.
Key Technical Details
- Each chunk is a
(BaseMessageChunk, dict)tuple -- typicallyAIMessageChunkfor LLM responses. - The
metadatadict always containslanggraph_nodeidentifying the source node. - The LLM must support streaming; most modern providers (OpenAI, Anthropic, Google) do by default.
- Tool call decisions stream as
tool_call_chunkson the message object, not ascontent. - Empty
contentstrings are common between tool calls -- always checkif msg.contentbefore rendering. - Token streaming works with both
graph.stream()(sync) andgraph.astream()(async). - The
"messages"mode only emits during LLM inference; non-LLM nodes are silent. - Message chunks include
idandresponse_metadatafor tracking and billing.
Common Misconceptions
- "Token streaming requires special LLM configuration." Most LangChain chat model wrappers stream by default. You rarely need to set
streaming=Trueexplicitly. - "Every chunk contains meaningful text." Many chunks have empty
content-- especially during tool call generation. Always guard withif msg.content. - "Token streaming shows all graph activity." It only shows LLM output. For full graph visibility, use
"events"or"updates"mode instead.
Connections to Other Concepts
stream-modes.md-- Overview of all four stream modes including messages mode.async-streaming.md-- Async token streaming withastream()for non-blocking applications.streaming-in-production.md-- Forwarding token streams over SSE and WebSocket connections.../02-tools-and-models/binding-tools-to-models.md-- How tool binding affects what appears in the token stream.../03-building-your-first-agent/prebuilt-react-agent.md-- The prebuilt agent used in examples above.../09-deployment/fastapi-deployment.md-- Integrating token streaming into a FastAPI endpoint.
Further Reading
- LangGraph Streaming Messages How-To -- Official guide to token streaming.
- LangChain Streaming Documentation -- How LangChain chat models handle streaming under the hood.
- OpenAI Streaming API Reference -- The upstream protocol that messages mode wraps.