Course · 8 modules · 70 lessons · 221 min

Agent Harnesses & Orchestration

The harness layer above LLMs — Claude Agent SDK, Codex CLI, Cursor, ruflo, LangGraph, AutoGen, CrewAI, and OpenAI Agents SDK compared concept-by-concept. Topologies, consensus, federation, planning, and the orchestration plumbing that turns models into systems.

← All courses
The Harness Layer
·Claude Agent SDK OverviewThe Claude Agent SDK is Anthropic's official toolkit for building harnesses (or harness-shaped applications) on top of Claude — it is the SDK that Claude Code itself is built on, exposing primitives for agent loops, tools, hooks, sub-agents, and MCP.3 min·Claude Code as HarnessClaude Code is Anthropic's official terminal harness — a CLI that wraps Claude with a programmable loop, hooks, sub-agents, slash commands, skills, MCP servers, and permission scoping, used in this course as the reference harness for examples and exercises.4 min·Codex CLI and Cursor as HarnessesCodex CLI is OpenAI's terminal coding harness — the OpenAI counterpart to Claude Code — while Cursor is the dominant IDE-coding harness; together they bracket the design space of single-developer agentic coding tools.3 min·Harness vs. Framework vs. SDKA *harness* is a deployed product that runs models for you (Claude Code, Cursor); a *framework* is a library you compose into your own application (LangGraph, AutoGen); an *SDK* is the toolkit for building either (Claude Agent SDK, OpenAI Agents SDK) — conflating them is the single most common error in 2026 agent infrastructure conversations.5 min·Harness vs. Orchestration FrameworkWithin the *harness* category there is a useful sub-distinction between *single-agent harnesses* (Claude Code, Codex CLI, Cursor) and *orchestration frameworks* / *orchestration platforms* (ruflo, OpenHands, AutoGPT-X) — the latter add multi-agent topology, swarms, federation, and autonomous loops on top of the harness loop.3 min·Ruflo Architecture TourRuflo (formerly claude-flow) is the most-adopted open-source multi-agent orchestration platform of 2026; it layers on top of Claude Code with 100+ specialized agents, 314 MCP tools, 27 hooks, 32 plugins, queen-led/mesh/adaptive topologies, AgentDB+ReasoningBank memory, federated zero-trust execution, and a SONA-based learning loop.3 min·The 2026 Harness LandscapeAs of mid-2026 the agent-harness market has split into roughly four categories — coding-IDE harnesses, terminal coding harnesses, orchestration platforms, and headless/agentic-OS harnesses — each represented by 2–4 dominant products with overlapping but distinct positioning.3 min·What Is an AI Harness?An AI harness is the orchestration layer that wraps a language model with the loop, tools, memory, permissions, and lifecycle hooks needed to turn raw model outputs into a working agentic system — it is what you actually deploy, not the model itself.11 min·Why the Harness Is the ProductAs frontier models commoditize within a benchmark point of each other, the harness — not the model — is what users adopt, customize, get locked into, and pay for; the harness layer captures most of the durable value in the agent economy.3 min
Harness Primitives
·Agent Definitions and PersonasAn *agent definition* is the file that declares a sub-agent's identity (system prompt + tools + model + termination) and makes it reusable across sessions; the *persona* is the part of that file that captures voice, role, and decision-making style — together they turn ad-hoc role prompts into versioned, composable artifacts.4 min·Hooks and Lifecycle EventsHooks are user-defined scripts that fire on harness lifecycle events — before a tool runs, after it returns, when a session starts, when the agent stops — letting you add policy, logging, validation, or transformation without forking the harness.3 min·MCP as the Universal Tool BusThe Model Context Protocol (MCP) is the cross-harness tool standard — a single MCP server runs identically inside Claude Code, Cursor, ruflo, Codex CLI, Zed, and Continue, which is why the same `github` or `postgres` server installation works everywhere and why MCP, not any harness's native tool format, became the lingua franca of harness extensions.4 min·Permission and Tool Scoping PrimitivesA harness's permission system — which tools a given agent can use, when they require user confirmation, and which paths/commands are off-limits — is enforced at the harness layer (not the model layer) and is the most important security primitive for any agentic deployment.3 min·Plugin and Marketplace SystemsA harness plugin is a packaged directory of extensions (sub-agents, hooks, slash commands, skills, MCP servers) that can be installed into a harness as a unit; a marketplace is the discovery layer that turns plugins into a distributed ecosystem — ruflo's marketplace and Claude Code's plugin system are the reference implementations in 2026.4 min·Settings and Configuration FilesA harness's configuration files (`settings.json`, `CLAUDE.md`, `.cursorrules`, `.ruflo/config.toml`) are its public API — the user-editable contract through which extensions, permissions, hooks, and memory are declared; their format and merge semantics matter as much as any code in the harness.3 min·Skills vs. ToolsTools are individual callable functions the model invokes by name (`read_file`, `run_tests`); skills are higher-level capabilities the model opts into mid-conversation that bundle a system prompt, instructions, and a curated set of tools — the skill is the unit a model decides to *adopt*; the tool is the unit it *calls*.3 min·Slash CommandsSlash commands are user-typed shortcuts (`/review`, `/test`, `/explain`) that inject a parameterized prompt or invoke a workflow inside an active harness session — they are the keyboard-first surface for harness-extension UX, sitting alongside hooks (system-driven) and tools (model-driven).3 min·Sub-Agents as PrimitivesA *sub-agent* is a full agent — its own context window, system prompt, scoped tool registry, and termination condition — that the main agent can spawn for a specific task; sub-agents differ from "role prompts" precisely because of that isolation, and treating them as the same is a common source of multi-agent bugs.3 min
Topologies And Coordination
·Adaptive Topology SwitchingAn adaptive topology switches between queen-led, mesh, hive-mind, and other shapes at runtime based on workload signals (task complexity, agent count, latency, cost) — the most sophisticated coordination pattern, exemplified by ruflo's adaptive mode, with significant complexity cost.3 min·Conversational OrchestrationConversational orchestration — the AutoGen pattern — coordinates multiple agents through a multi-turn dialogue rather than dispatch-and-return: agents take turns speaking in a shared transcript, with a moderator deciding who goes next, and agreement emerges from the conversation itself.3 min·Hive Mind PatternA hive mind is a topology where many simple agents share a common memory store and produce emergent behavior that no individual agent encodes — closer to swarm intelligence than to a structured organization, useful when the problem benefits from many independent partial solutions that combine.3 min·Mesh TopologyA mesh topology lets every agent talk to every other agent directly, with no central coordinator — useful when peers genuinely need to negotiate, but expensive in tokens and hard to debug, so it is rarely the right default.3 min·Queen-Led HierarchyA queen-led topology has a single high-authority "queen" agent that allocates tasks to a pool of workers, arbitrates conflicts, and decides when work is done — ruflo's flagship topology and the most token-efficient way to coordinate 5+ agents on a complex task.3 min·Role-Based OrchestrationRole-based orchestration — popularized by CrewAI — assigns work by *role* (researcher, writer, editor) rather than by topology shape, with each role's persona, tools, and termination condition baked into a reusable definition; the topology emerges from how the roles are wired together.3 min·Supervisor Pattern Deep DiveThe supervisor pattern is the framework-vocabulary cousin of queen-led: one supervisor agent routes tasks to specialist agents and gathers results — it is the strong default recommended by Anthropic's "Building Effective Agents," and the topology you should pick when in doubt.3 min·Topology as a Design DecisionThe shape of how agents connect — single, supervisor-led, mesh, hive mind, queen-led, or adaptive — is a deliberate design decision with concrete cost, latency, and reliability consequences, not an emergent property of running multiple agents.3 min·Topology Selection Decision TreeA practical decision tree for picking a topology: start with single-agent, escalate to supervisor / queen-led only when single-agent demonstrably falls short, escalate beyond that only for specific patterns (mesh for negotiation, hive-mind for exploration, conversational for discussion, federation for cross-trust).3 min
Planning And Replanning
·A* Planner for AgentsA* is the classical heuristic search algorithm at the heart of GOAP and most structured agent planners — it finds the lowest-cost action sequence from current state to goal state by expanding nodes in order of *cost-so-far + estimated-cost-to-goal*, and it is the workhorse of any harness that does plan-shaped (rather than chain-of-thought-shaped) planning.3 min·Adaptive ReplanningAdaptive replanning is the discipline of detecting when the current plan no longer fits reality (a tool failed, a precondition was violated, a result surprised the agent) and rebuilding a new plan from the post-divergence state — every long-horizon agent system needs it; the question is how the harness expresses it.3 min·Goal-Oriented Action Planning (GOAP)GOAP is a planning technique borrowed from game AI where the agent searches a graph of available actions for a sequence whose preconditions and effects connect the current world state to a goal — used in modern harnesses as a structured alternative to free-form chain-of-thought planning.3 min·Multi-Step Plan EvaluationEvaluating an agent's *plan* — separately from evaluating its execution — lets you detect bad plans before they burn tokens, and lets you compare planning strategies; the harness usually exposes evaluation as a hook between plan generation and execution.3 min·Plan-Driven vs. Reactive HarnessesPlan-driven harnesses (ruflo, LangGraph) build a structured plan upfront and execute against it; reactive harnesses (Cursor, Codex CLI in default mode) decide each next step based on what just happened — both are valid; the choice is mostly about task horizon and the cost of upfront planning.3 min·Plan Graphs vs. Plan StringsA *plan string* is what an LLM emits when you ask it to "plan first" — a numbered list embedded in chain-of-thought; a *plan graph* is a structured, typed representation of the plan that the harness can inspect, verify, and replay — graphs are dramatically more reliable for non-trivial tasks, at the cost of more upfront engineering.3 min·Plan Rollback and CheckpointingRollback is the harness's ability to undo actions taken by an agent (file edits, branch creates, tool side effects) when a plan fails or is replanned; checkpointing is the snapshotting that makes rollback possible — together they are the difference between a recoverable agent and a destructive one.3 min·Speculative Planning and BranchingSpeculative planning explores multiple candidate plans in parallel — picking the best one only after partial execution — at higher token cost in exchange for lower wall-clock latency and better outcomes on hard tasks; closer to chess-engine search than to typical LLM planning.3 min
Memory And Learning
·AgentDB and Vector Stores in HarnessesAgentDB (ruflo's purpose-built vector database) and vector stores generally are the harness's substrate for semantic recall — embeddings of past trajectories, code snippets, documents, and decisions are kept in a queryable index so the agent can retrieve relevant memories on demand.3 min·Cross-Session Memory StrategiesCross-session memory strategies decide what an agent remembers between conversations — the durable artifacts (configuration files, summaries, trajectories, adapters) and the policies for writing, retrieving, and aging them; this is one of the highest-leverage UX dimensions of any harness.3 min·Harness-Owned MemoryDurable agent memory — across turns, sessions, machines, and users — is owned by the harness, not the model; this is one of the harness's load-bearing responsibilities and a major axis on which harnesses differentiate.2 min·HNSW for Agent RecallHNSW (Hierarchical Navigable Small World) is the dominant approximate-nearest-neighbor index used by agent vector stores — it is the data structure underneath AgentDB, Pinecone, Qdrant, Weaviate, and most production memory layers, and understanding its trade-offs explains a lot about why agent recall feels the way it does.3 min·Memory Portability Across HarnessesMemory portability — whether the artifacts you've built up in one harness work in another — is partial in 2026: configuration files (`CLAUDE.md`, `AGENTS.md`, `.cursorrules`) are convergent enough to copy-with-edits; vector stores and trajectory stores are mostly per-harness; adapters are model-specific; expect a portability gradient, not a clean abstraction.3 min·Micro-LoRA Adapters at the Harness LayerMicro-LoRA adapters are small, project-scoped low-rank fine-tunes (typically <50MB) that the harness can load on top of a base model to bias it toward the project's conventions, vocabulary, and successful trajectories — emerging in 2026 as a way to give agents a kind of parametric memory without the cost of full fine-tuning.3 min·ReasoningBankReasoningBank is ruflo's named pattern for storing whole *trajectories* — the sequence of (state, decision, outcome) tuples an agent produced — as memory the system can replay or learn from; it is a specialized vector store optimized for trajectory-shaped data and a key driver of ruflo's claimed self-learning behavior.3 min·SONA: Self-Learning Neural PatternsSONA is ruflo's pattern-matching layer that learns *which strategies* tend to succeed for *which task signatures*, sitting one level above ReasoningBank — instead of replaying trajectories verbatim, SONA distills them into reusable patterns that bias the agent's planner toward known-good moves.3 min·Trajectory LearningTrajectory learning is the family of techniques that learn from full agent rollouts (state-action-outcome sequences) rather than from isolated examples — it includes simple replay (store trajectories, retrieve at run time) and stronger forms (parametric updates via fine-tuning or LoRA on successful trajectories).3 min
Consensus And Federation
·Behavioral Trust ScoringBehavioral trust scoring assigns each federated peer a reputation score that updates based on observed behavior (latency, accuracy, protocol compliance, malicious actions detected) — and uses that score to gate privileges; cryptographic identity proves *who*, behavioral trust proves *whether they should be allowed*.3 min·Byzantine Fault-Tolerant AgentsByzantine fault-tolerant (BFT) protocols handle the case where peers may not just fail but actively misbehave — returning wrong data, breaking the protocol, colluding — with the cost of needing 3f+1 peers to tolerate f bad ones; for federated agent systems with peers from untrusted parties, BFT is the right correctness model.3 min·Consensus in Multi-Agent SystemsConsensus protocols — Raft, Byzantine, gossip — are how multiple agents agree on state, decisions, or outputs in the presence of disagreement, latency, or untrusted peers, and they are increasingly first-class primitives in modern multi-agent harnesses.2 min·Cross-Machine Agent FederationFederation lets agents on different machines (and sometimes different organizations) collaborate on tasks while preserving each side's privacy, trust assumptions, and resource budgets — exemplified by ruflo's federation mode, which combines mTLS for transport, ed25519 for identity, gossip for membership, and Raft/BFT for shared decisions.3 min·Gossip Protocols for AgentsGossip protocols spread information probabilistically — each peer periodically picks a few random peers and exchanges state with them, converging the cluster toward a shared view over time without any leader; for large agent populations where eventual consistency is acceptable, gossip is the right scaling strategy.3 min·mTLS and ed25519 for Agent TrustMutual TLS (both sides authenticate via certificates) and ed25519 message signatures (compact, fast, modern) are the cryptographic substrate of federated agent systems — they are how a remote agent proves "I am who I say I am" before any meaningful interaction begins.3 min·PII Gating and AIDefencePII gating is the harness-layer scrubbing of personally identifiable information (and secrets, credentials, sensitive metadata) from data flowing across trust boundaries; ruflo's `AIDefence` plugin is the reference implementation, identifying 14+ classes of sensitive data and either redacting, blocking, or alerting based on configured policy.3 min·Prompt Injection Defense in HarnessesPrompt injection — adversarial text embedded in retrieved content, tool outputs, files, or messages that hijacks the agent's behavior — is defended at the harness layer through a defense-in-depth stack: input sanitization, content provenance tracking, tool permission scoping, hook-based blocking, and behavioral monitoring.3 min·Raft for AgentsRaft is a distributed-consensus protocol that elects a leader from a peer group and serializes all decisions through that leader, with a clean recovery story when the leader fails — applied to agent systems, Raft gives a peer group a way to agree on shared state (a plan, a memory entry, a verdict) without trusting any single agent permanently.3 min
Background Workers And Autopilots
·Audit and Optimize WorkersAudit workers continuously inspect recent agent activity (commits, edits, decisions) for regressions, anti-patterns, or risks; optimize workers proactively rewrite code, prompts, or configurations toward measured improvements — together they form the most-cited concrete example of the background-worker pattern.3 min·Autopilot ModesAutopilot modes let the harness run an agent without per-action user confirmation — bounded by a budget (tokens, time, steps), gated by permission scopes, monitored by background workers, and ended by an explicit checkpoint where the user reviews — they are the UX surface that makes long-horizon agentic work practical.3 min·The Background Worker PatternBackground workers are agents the harness runs *between* user turns — auditing recent changes, optimizing code, looking for test gaps, refreshing memory — without requiring the user to ask, and they are one of the most important emerging patterns in 2026 harnesses.2 min·Continuous Execution LoopsA continuous-execution loop runs an agent indefinitely against a stream of tasks, events, or goals — distinct from a "session" that has a start and end — and is the runtime model that supports background workers, autopilot, federated agents, and always-on agentic services.3 min·Event-Driven Harness ArchitecturesAn event-driven harness reacts to events — file changes, GitHub webhooks, build completions, schedule triggers — by invoking the agent loop without a user typing anything; this architecture turns a user-driven harness into an autonomous service and is the substrate for background workers, autopilot, and federated coordination.3 min·Methodology as Plugin: ADR and DDDArchitecture Decision Records (ADR) and Domain-Driven Design (DDD) are the two most-cited "discipline" methodologies in software engineering; ruflo packages each as a plugin (`ruflo-adr`, `ruflo-ddd`) so the discipline becomes a slash command rather than a team practice.3 min·Methodology as Plugin: SPARCSPARC (Specification, Pseudocode, Architecture, Refinement, Code) is an agent-driven engineering methodology packaged as a ruflo plugin (`ruflo-sparc`) — it is the cleanest example of how a software methodology can be encoded as a multi-step agent workflow, not just adopted as a habit.3 min·Testgap and Coverage WorkersA testgap worker continuously identifies code without test coverage and proposes (or generates) tests; coverage workers track what's covered and aren't, surface deltas after each session, and prevent slow erosion of test quality — among the highest-leverage background workers because the work they do is something humans skip under time pressure.3 min
Harness Economics And Comparison
·Choosing Your Harness StackThe capstone decision: pick a harness (interactive surface), decide whether you need an orchestration platform on top (multi-agent / autopilot), pick an SDK if you're building rather than using, and lean on MCP and configuration files to keep the choice reversible — most of the cost of getting it wrong is portability cost, which is partly mitigable.4 min·Claude Code vs. Codex CLI vs. CursorSide-by-side comparison of the three dominant single-developer coding harnesses in 2026 — Claude Code (terminal-first, hooks-rich, sub-agent-capable), Codex CLI (terminal-first OpenAI counterpart, simpler primitives), Cursor (IDE-tight, agent-mode autopilot, IDE-shaped extensibility).3 min·Harness Cost ModelsA harness's cost is dominated not by per-token model price but by how often it calls the model, how aggressively it caches the prefix, when it falls back to a cheaper model, and how many sub-agents it parallelizes — these are harness-level decisions, not model-level ones.2 min·LangGraph vs. AutoGen vs. CrewAISide-by-side comparison of the three dominant agent frameworks in 2026 — LangGraph (graph-based, explicit state, production-leaning), AutoGen (conversational multi-agent, dialogue-centric), CrewAI (role-based, opinionated, approachable) — each shines for different problem shapes and team backgrounds.3 min·Model Routing in HarnessesModel routing is the harness-layer decision of which model handles which turn — a small fast model for routing/classification, a large smart model for hard reasoning, a code-tuned model for coding subtasks; routing is the second-largest cost lever after caching, and a major source of harness differentiation.3 min·OpenAI Agents SDK, Mastra, and Google ADKThe 2025 "second-wave" of agent SDKs — OpenAI Agents SDK, Mastra (TypeScript-first), and Google ADK (Agent Development Kit) — converged on a similar shape: opinionated agent + handoff + guardrail primitives sitting between bare API calls and a full framework like LangGraph; a useful comparison if you're picking an SDK.3 min·Prompt and Context CachingPrompt caching reuses computation for repeated prefixes — system prompts, long instructions, recently-seen documents — at 5–10× cost savings on cache-hit tokens; it is the single largest cost lever in any agent system, and harness-layer prompt structure determines whether you actually capture it.3 min·SWE-bench and Harness LeaderboardsSWE-bench is the dominant agent benchmark for software engineering tasks, and harness leaderboards (top scores published by ruflo, Aider, Devin, Cursor, OpenHands) are how the harness-layer competition is now measured — a 2026 frontier harness scoring 80%+ on SWE-bench Verified is roughly a year-over-year doubling of capability.3 min·The 75% Savings ClaimRuflo's headline claim of "75% API cost savings vs. Claude Code direct" is plausible but conditional on workload — the savings come from prompt caching discipline + multi-provider routing + parallel tool calls + cheaper-model fallback; this concept audits the claim and shows where it does and doesn't hold.3 min