Raft for Agents

One-Line Summary: Raft is a distributed-consensus protocol that elects a leader from a peer group and serializes all decisions through that leader, with a clean recovery story when the leader fails — applied to agent systems, Raft gives a peer group a way to agree on shared state (a plan, a memory entry, a verdict) without trusting any single agent permanently.

Prerequisites: Consensus in multi-agent systems

What Is Raft?

Raft (Ongaro & Ousterhout, 2014) was designed as an understandable alternative to Paxos. The protocol's mechanics:

Leader election: One peer is elected leader by majority vote, with terms that increase monotonically.
Log replication: All decisions go through the leader, who appends them to a log and replicates to followers.
Commit: An entry is committed once a majority of peers have replicated it.
Failure recovery: If the leader is unreachable, peers elect a new leader at a higher term.

The guarantees: as long as a majority of peers are reachable, the system makes progress and never produces inconsistent state.

In a multi-agent system, Raft applies when peers need to agree on something that must be consistent: a final answer, a shared plan, a memory write, a permission grant. Without consensus, two agents can produce contradictory updates that overwrite each other.

The peer "agents" in the Raft sense are usually coordinator processes (one per swarm member) rather than the LLM-driven agents themselves. The LLM agents propose; the coordinators run Raft.

A typical use case: a federated swarm has 5 agents from different organizations. They produce 5 review verdicts on a PR. Rather than trusting one to aggregate, the coordinators run Raft to commit a single agreed verdict to a shared log.

Why It Matters

For multi-agent systems where state coherence matters — shared memory, plan agreement, verdict logging — Raft is the well-understood, well-implemented option. Most teams shouldn't reimplement it; they should adopt a library (etcd's Raft, hashicorp/raft, or ruflo's wrapper).

Raft is not a fit for every multi-agent scenario. If peers are non-adversarial and one has clear authority, queen-led + a database is simpler. Raft pays off when leadership is contested or peers fail unpredictably.

Key Technical Details

Quorum is majority: For 2f+1 peers, the system tolerates f failures. 3 peers tolerate 1 failure; 5 tolerate 2.
Latency is bounded by majority replication: Each commit waits for a majority of acks. Geographic distribution adds RTT.
Leader election takes a term: Election timeouts are tunable; aggressive timeouts cause unnecessary elections.
Log compaction: Logs grow without bound; periodic snapshotting is needed.
Read consistency options: Linearizable reads go through the leader; eventual reads can hit any peer.
Raft is for crash failures, not Byzantine ones: A peer that returns wrong data isn't handled by Raft. Use Byzantine protocols for that case (next concept).
Membership changes are subtle: Adding/removing peers needs joint consensus — Raft handles it but the procedure is non-trivial.

How Harnesses & Frameworks Implement This

Harness / Framework	Raft support
Claude Code	None natively
Claude Agent SDK	DIY
ruflo	First-class — selectable consensus per swarm; `ruflo-federation`
LangGraph	DIY — model state-graph as Raft log
AutoGen	DIY
CrewAI	DIY
OpenAI Agents SDK	DIY
Codex CLI / Cursor	✗

Connections to Other Concepts

consensus-in-multi-agent-systems.md — Parent concept.
byzantine-fault-tolerant-agents.md, gossip-protocols-for-agents.md — Alternative protocols.
cross-machine-agent-federation.md — The setting where Raft becomes necessary.

Raft for Agents

What Is Raft?

Raft for Agents

Why It Matters

Key Technical Details

How Harnesses & Frameworks Implement This

Connections to Other Concepts

Further Reading