Byzantine Fault-Tolerant Agents

One-Line Summary: Byzantine fault-tolerant (BFT) protocols handle the case where peers may not just fail but actively misbehave — returning wrong data, breaking the protocol, colluding — with the cost of needing 3f+1 peers to tolerate f bad ones; for federated agent systems with peers from untrusted parties, BFT is the right correctness model.

Prerequisites: Raft for agents, consensus in multi-agent systems, cross-machine agent federation

What Is Byzantine Fault Tolerance?

The "Byzantine generals problem" (Lamport, Shostak, Pease, 1982) asks: how do n generals coordinate an attack when some of them may be traitors who lie about their messages? The answer requires 3f+1 generals to tolerate f traitors — and the protocols that achieve this are collectively called Byzantine fault-tolerant (BFT).

For agent systems, "traitors" are peers that may be:

Compromised (a plugin from an untrusted source).
Buggy (returning malformed responses without raising).
Adversarial (a federated peer from a competing organization).
Subject to prompt injection (yielding to attacker control).

BFT protocols ensure the system reaches correct consensus despite such peers, as long as fewer than f of them go rogue out of 3f+1 total.

How It Works

Practical Byzantine Fault Tolerance (PBFT) is the most-cited reference. Simplified flow:

Client → primary: A client sends a request to a designated primary peer.
Pre-prepare: The primary broadcasts the request to all backups.
Prepare: Each backup broadcasts a prepare message acknowledging the request.
Commit: Once a backup has received 2f+1 prepares (including its own), it broadcasts a commit.
Reply: Once a peer has received 2f+1 commits, it executes and replies to the client.
View change: If the primary is suspected faulty, peers initiate a view change to elect a new one.

The cost is communication: PBFT is O(n²) messages per request. Modern BFT protocols (HotStuff, Tendermint) reduce this to O(n) with chained commits.

Why It Matters for Agents

Federated agent platforms — ruflo's federation mode, multi-organization agent collaborations, agent marketplaces with installable third-party agents — necessarily face Byzantine peers. A plugin from an unknown developer might be malicious, a federated peer might be compromised, a plugin updated yesterday might have introduced a bug.

BFT is the protocol-level answer to "can we trust this swarm's consensus?" Without it, one bad peer can corrupt outputs, manipulate memory writes, or subvert decisions silently.

Key Technical Details

3f+1 minimum: To tolerate f Byzantine peers. To tolerate 1 you need 4; to tolerate 2 you need 7.
BFT is more expensive than Raft: More peers, more messages, longer latencies. Use it when threat model justifies.
Cryptographic signatures are required: Messages must be signed so receivers can verify origin. (See mtls-and-ed25519-for-agent-trust.md.)
View changes have liveness implications: Aggressive view-change triggers cause thrashing; conservative ones leave faulty primaries in place.
Hybrid setups are common: BFT for inter-organization communication; Raft within each organization. Cheaper inside, robust at the boundary.
Performance ceilings: Even modern BFT (HotStuff) maxes out around 10k tx/sec. Agent workloads are well within that range.
Protocol choice depends on adversary model: Crash-stop tolerance is Raft. Byzantine + asynchronous is HoneyBadger-style. Byzantine + partial synchrony is PBFT/HotStuff.

How Harnesses & Frameworks Implement This

Harness / Framework	BFT support
Claude Code	None
Claude Agent SDK	DIY
ruflo	First-class — selectable per swarm in federation mode
LangGraph	DIY
AutoGen	DIY
CrewAI	DIY
OpenAI Agents SDK	DIY
Codex CLI / Cursor	✗

Connections to Other Concepts

raft-for-agents.md — The crash-fault-tolerant counterpart.
cross-machine-agent-federation.md — The natural setting.
behavioral-trust-scoring.md — Reputation layer above protocol consensus.
mtls-and-ed25519-for-agent-trust.md — The signature substrate.