One-Line Summary: Raft is a distributed-consensus protocol that elects a leader from a peer group and serializes all decisions through that leader, with a clean recovery story when the leader fails — applied to agent systems, Raft gives a peer group a way to agree on shared state (a plan, a memory entry, a verdict) without trusting any single agent permanently.
Prerequisites: Consensus in multi-agent systems
What Is Raft?
Raft (Ongaro & Ousterhout, 2014) was designed as an understandable alternative to Paxos. The protocol's mechanics:
- Leader election: One peer is elected leader by majority vote, with terms that increase monotonically.
- Log replication: All decisions go through the leader, who appends them to a log and replicates to followers.
- Commit: An entry is committed once a majority of peers have replicated it.
- Failure recovery: If the leader is unreachable, peers elect a new leader at a higher term.
The guarantees: as long as a majority of peers are reachable, the system makes progress and never produces inconsistent state.
Raft for Agents
In a multi-agent system, Raft applies when peers need to agree on something that must be consistent: a final answer, a shared plan, a memory write, a permission grant. Without consensus, two agents can produce contradictory updates that overwrite each other.
The peer "agents" in the Raft sense are usually coordinator processes (one per swarm member) rather than the LLM-driven agents themselves. The LLM agents propose; the coordinators run Raft.
A typical use case: a federated swarm has 5 agents from different organizations. They produce 5 review verdicts on a PR. Rather than trusting one to aggregate, the coordinators run Raft to commit a single agreed verdict to a shared log.
Why It Matters
For multi-agent systems where state coherence matters — shared memory, plan agreement, verdict logging — Raft is the well-understood, well-implemented option. Most teams shouldn't reimplement it; they should adopt a library (etcd's Raft, hashicorp/raft, or ruflo's wrapper).
Raft is not a fit for every multi-agent scenario. If peers are non-adversarial and one has clear authority, queen-led + a database is simpler. Raft pays off when leadership is contested or peers fail unpredictably.
Key Technical Details
- Quorum is majority: For 2f+1 peers, the system tolerates f failures. 3 peers tolerate 1 failure; 5 tolerate 2.
- Latency is bounded by majority replication: Each commit waits for a majority of acks. Geographic distribution adds RTT.
- Leader election takes a term: Election timeouts are tunable; aggressive timeouts cause unnecessary elections.
- Log compaction: Logs grow without bound; periodic snapshotting is needed.
- Read consistency options: Linearizable reads go through the leader; eventual reads can hit any peer.
- Raft is for crash failures, not Byzantine ones: A peer that returns wrong data isn't handled by Raft. Use Byzantine protocols for that case (next concept).
- Membership changes are subtle: Adding/removing peers needs joint consensus — Raft handles it but the procedure is non-trivial.
How Harnesses & Frameworks Implement This
| Harness / Framework | Raft support |
|---|---|
| Claude Code | None natively |
| Claude Agent SDK | DIY |
| ruflo | First-class — selectable consensus per swarm; ruflo-federation |
| LangGraph | DIY — model state-graph as Raft log |
| AutoGen | DIY |
| CrewAI | DIY |
| OpenAI Agents SDK | DIY |
| Codex CLI / Cursor | ✗ |
Connections to Other Concepts
consensus-in-multi-agent-systems.md— Parent concept.byzantine-fault-tolerant-agents.md,gossip-protocols-for-agents.md— Alternative protocols.cross-machine-agent-federation.md— The setting where Raft becomes necessary.
Further Reading
- Ongaro & Ousterhout, "In Search of an Understandable Consensus Algorithm" (2014).
- raft.github.io — Visualizations and reference implementations.