Course · 8 modules · 70 lessons · 221 min

Agent Harnesses & Orchestration

The harness layer above LLMs — Claude Agent SDK, Codex CLI, Cursor, ruflo, LangGraph, AutoGen, CrewAI, and OpenAI Agents SDK compared concept-by-concept. Topologies, consensus, federation, planning, and the orchestration plumbing that turns models into systems.

← All courses

The Harness Layer

·Claude Agent SDK OverviewThe Claude Agent SDK is Anthropic's official toolkit for building harnesses (or harness-shaped applications) on top of Claude — it is the SDK that Claude Code itself is built on, exposing primitives for agent loops, tools, hooks, sub-agents, and MCP.3 min→·Claude Code as HarnessClaude Code is Anthropic's official terminal harness — a CLI that wraps Claude with a programmable loop, hooks, sub-agents, slash commands, skills, MCP servers, and permission scoping, used in this course as the reference harness for examples and exercises.4 min→·Codex CLI and Cursor as HarnessesCodex CLI is OpenAI's terminal coding harness — the OpenAI counterpart to Claude Code — while Cursor is the dominant IDE-coding harness; together they bracket the design space of single-developer agentic coding tools.3 min→·Harness vs. Framework vs. SDKA *harness* is a deployed product that runs models for you (Claude Code, Cursor); a *framework* is a library you compose into your own application (LangGraph, AutoGen); an *SDK* is the toolkit for building either (Claude Agent SDK, OpenAI Agents SDK) — conflating them is the single most common error in 2026 agent infrastructure conversations.5 min→·Harness vs. Orchestration FrameworkWithin the *harness* category there is a useful sub-distinction between *single-agent harnesses* (Claude Code, Codex CLI, Cursor) and *orchestration frameworks* / *orchestration platforms* (ruflo, OpenHands, AutoGPT-X) — the latter add multi-agent topology, swarms, federation, and autonomous loops on top of the harness loop.3 min→·Ruflo Architecture TourRuflo (formerly claude-flow) is the most-adopted open-source multi-agent orchestration platform of 2026; it layers on top of Claude Code with 100+ specialized agents, 314 MCP tools, 27 hooks, 32 plugins, queen-led/mesh/adaptive topologies, AgentDB+ReasoningBank memory, federated zero-trust execution, and a SONA-based learning loop.3 min→·The 2026 Harness LandscapeAs of mid-2026 the agent-harness market has split into roughly four categories — coding-IDE harnesses, terminal coding harnesses, orchestration platforms, and headless/agentic-OS harnesses — each represented by 2–4 dominant products with overlapping but distinct positioning.3 min→·What Is an AI Harness?An AI harness is the orchestration layer that wraps a language model with the loop, tools, memory, permissions, and lifecycle hooks needed to turn raw model outputs into a working agentic system — it is what you actually deploy, not the model itself.11 min→·Why the Harness Is the ProductAs frontier models commoditize within a benchmark point of each other, the harness — not the model — is what users adopt, customize, get locked into, and pay for; the harness layer captures most of the durable value in the agent economy.3 min→

Harness Primitives

·Agent Definitions and PersonasAn *agent definition* is the file that declares a sub-agent's identity (system prompt + tools + model + termination) and makes it reusable across sessions; the *persona* is the part of that file that captures voice, role, and decision-making style — together they turn ad-hoc role prompts into versioned, composable artifacts.4 min→·Hooks and Lifecycle EventsHooks are user-defined scripts that fire on harness lifecycle events — before a tool runs, after it returns, when a session starts, when the agent stops — letting you add policy, logging, validation, or transformation without forking the harness.3 min→·MCP as the Universal Tool BusThe Model Context Protocol (MCP) is the cross-harness tool standard — a single MCP server runs identically inside Claude Code, Cursor, ruflo, Codex CLI, Zed, and Continue, which is why the same `github` or `postgres` server installation works everywhere and why MCP, not any harness's native tool format, became the lingua franca of harness extensions.4 min→·Permission and Tool Scoping PrimitivesA harness's permission system — which tools a given agent can use, when they require user confirmation, and which paths/commands are off-limits — is enforced at the harness layer (not the model layer) and is the most important security primitive for any agentic deployment.3 min→·Plugin and Marketplace SystemsA harness plugin is a packaged directory of extensions (sub-agents, hooks, slash commands, skills, MCP servers) that can be installed into a harness as a unit; a marketplace is the discovery layer that turns plugins into a distributed ecosystem — ruflo's marketplace and Claude Code's plugin system are the reference implementations in 2026.4 min→·Settings and Configuration FilesA harness's configuration files (`settings.json`, `CLAUDE.md`, `.cursorrules`, `.ruflo/config.toml`) are its public API — the user-editable contract through which extensions, permissions, hooks, and memory are declared; their format and merge semantics matter as much as any code in the harness.3 min→·Skills vs. ToolsTools are individual callable functions the model invokes by name (`read_file`, `run_tests`); skills are higher-level capabilities the model opts into mid-conversation that bundle a system prompt, instructions, and a curated set of tools — the skill is the unit a model decides to *adopt*; the tool is the unit it *calls*.3 min→·Slash CommandsSlash commands are user-typed shortcuts (`/review`, `/test`, `/explain`) that inject a parameterized prompt or invoke a workflow inside an active harness session — they are the keyboard-first surface for harness-extension UX, sitting alongside hooks (system-driven) and tools (model-driven).3 min→·Sub-Agents as PrimitivesA *sub-agent* is a full agent — its own context window, system prompt, scoped tool registry, and termination condition — that the main agent can spawn for a specific task; sub-agents differ from "role prompts" precisely because of that isolation, and treating them as the same is a common source of multi-agent bugs.3 min→

Topologies And Coordination

·Adaptive Topology SwitchingAn adaptive topology switches between queen-led, mesh, hive-mind, and other shapes at runtime based on workload signals (task complexity, agent count, latency, cost) — the most sophisticated coordination pattern, exemplified by ruflo's adaptive mode, with significant complexity cost.3 min→·Conversational OrchestrationConversational orchestration — the AutoGen pattern — coordinates multiple agents through a multi-turn dialogue rather than dispatch-and-return: agents take turns speaking in a shared transcript, with a moderator deciding who goes next, and agreement emerges from the conversation itself.3 min→·Hive Mind PatternA hive mind is a topology where many simple agents share a common memory store and produce emergent behavior that no individual agent encodes — closer to swarm intelligence than to a structured organization, useful when the problem benefits from many independent partial solutions that combine.3 min→·Mesh TopologyA mesh topology lets every agent talk to every other agent directly, with no central coordinator — useful when peers genuinely need to negotiate, but expensive in tokens and hard to debug, so it is rarely the right default.3 min→·Queen-Led HierarchyA queen-led topology has a single high-authority "queen" agent that allocates tasks to a pool of workers, arbitrates conflicts, and decides when work is done — ruflo's flagship topology and the most token-efficient way to coordinate 5+ agents on a complex task.3 min→·Role-Based OrchestrationRole-based orchestration — popularized by CrewAI — assigns work by *role* (researcher, writer, editor) rather than by topology shape, with each role's persona, tools, and termination condition baked into a reusable definition; the topology emerges from how the roles are wired together.3 min→·Supervisor Pattern Deep DiveThe supervisor pattern is the framework-vocabulary cousin of queen-led: one supervisor agent routes tasks to specialist agents and gathers results — it is the strong default recommended by Anthropic's "Building Effective Agents," and the topology you should pick when in doubt.3 min→·Topology as a Design DecisionThe shape of how agents connect — single, supervisor-led, mesh, hive mind, queen-led, or adaptive — is a deliberate design decision with concrete cost, latency, and reliability consequences, not an emergent property of running multiple agents.3 min→·Topology Selection Decision TreeA practical decision tree for picking a topology: start with single-agent, escalate to supervisor / queen-led only when single-agent demonstrably falls short, escalate beyond that only for specific patterns (mesh for negotiation, hive-mind for exploration, conversational for discussion, federation for cross-trust).3 min→

Planning And Replanning

·A* Planner for AgentsA* is the classical heuristic search algorithm at the heart of GOAP and most structured agent planners — it finds the lowest-cost action sequence from current state to goal state by expanding nodes in order of *cost-so-far + estimated-cost-to-goal*, and it is the workhorse of any harness that does plan-shaped (rather than chain-of-thought-shaped) planning.3 min→·Adaptive ReplanningAdaptive replanning is the discipline of detecting when the current plan no longer fits reality (a tool failed, a precondition was violated, a result surprised the agent) and rebuilding a new plan from the post-divergence state — every long-horizon agent system needs it; the question is how the harness expresses it.3 min→·Goal-Oriented Action Planning (GOAP)GOAP is a planning technique borrowed from game AI where the agent searches a graph of available actions for a sequence whose preconditions and effects connect the current world state to a goal — used in modern harnesses as a structured alternative to free-form chain-of-thought planning.3 min→·Multi-Step Plan EvaluationEvaluating an agent's *plan* — separately from evaluating its execution — lets you detect bad plans before they burn tokens, and lets you compare planning strategies; the harness usually exposes evaluation as a hook between plan generation and execution.3 min→·Plan-Driven vs. Reactive HarnessesPlan-driven harnesses (ruflo, LangGraph) build a structured plan upfront and execute against it; reactive harnesses (Cursor, Codex CLI in default mode) decide each next step based on what just happened — both are valid; the choice is mostly about task horizon and the cost of upfront planning.3 min→·Plan Graphs vs. Plan StringsA *plan string* is what an LLM emits when you ask it to "plan first" — a numbered list embedded in chain-of-thought; a *plan graph* is a structured, typed representation of the plan that the harness can inspect, verify, and replay — graphs are dramatically more reliable for non-trivial tasks, at the cost of more upfront engineering.3 min→·Plan Rollback and CheckpointingRollback is the harness's ability to undo actions taken by an agent (file edits, branch creates, tool side effects) when a plan fails or is replanned; checkpointing is the snapshotting that makes rollback possible — together they are the difference between a recoverable agent and a destructive one.3 min→·Speculative Planning and BranchingSpeculative planning explores multiple candidate plans in parallel — picking the best one only after partial execution — at higher token cost in exchange for lower wall-clock latency and better outcomes on hard tasks; closer to chess-engine search than to typical LLM planning.3 min→

Memory And Learning

·AgentDB and Vector Stores in HarnessesAgentDB (ruflo's purpose-built vector database) and vector stores generally are the harness's substrate for semantic recall — embeddings of past trajectories, code snippets, documents, and decisions are kept in a queryable index so the agent can retrieve relevant memories on demand.3 min→·Cross-Session Memory StrategiesCross-session memory strategies decide what an agent remembers between conversations — the durable artifacts (configuration files, summaries, trajectories, adapters) and the policies for writing, retrieving, and aging them; this is one of the highest-leverage UX dimensions of any harness.3 min→·Harness-Owned MemoryDurable agent memory — across turns, sessions, machines, and users — is owned by the harness, not the model; this is one of the harness's load-bearing responsibilities and a major axis on which harnesses differentiate.2 min→·HNSW for Agent RecallHNSW (Hierarchical Navigable Small World) is the dominant approximate-nearest-neighbor index used by agent vector stores — it is the data structure underneath AgentDB, Pinecone, Qdrant, Weaviate, and most production memory layers, and understanding its trade-offs explains a lot about why agent recall feels the way it does.3 min→·Memory Portability Across HarnessesMemory portability — whether the artifacts you've built up in one harness work in another — is partial in 2026: configuration files (`CLAUDE.md`, `AGENTS.md`, `.cursorrules`) are convergent enough to copy-with-edits; vector stores and trajectory stores are mostly per-harness; adapters are model-specific; expect a portability gradient, not a clean abstraction.3 min→·Micro-LoRA Adapters at the Harness LayerMicro-LoRA adapters are small, project-scoped low-rank fine-tunes (typically <50MB) that the harness can load on top of a base model to bias it toward the project's conventions, vocabulary, and successful trajectories — emerging in 2026 as a way to give agents a kind of parametric memory without the cost of full fine-tuning.3 min→·ReasoningBankReasoningBank is ruflo's named pattern for storing whole *trajectories* — the sequence of (state, decision, outcome) tuples an agent produced — as memory the system can replay or learn from; it is a specialized vector store optimized for trajectory-shaped data and a key driver of ruflo's claimed self-learning behavior.3 min→·SONA: Self-Learning Neural PatternsSONA is ruflo's pattern-matching layer that learns *which strategies* tend to succeed for *which task signatures*, sitting one level above ReasoningBank — instead of replaying trajectories verbatim, SONA distills them into reusable patterns that bias the agent's planner toward known-good moves.3 min→·Trajectory LearningTrajectory learning is the family of techniques that learn from full agent rollouts (state-action-outcome sequences) rather than from isolated examples — it includes simple replay (store trajectories, retrieve at run time) and stronger forms (parametric updates via fine-tuning or LoRA on successful trajectories).3 min→

Consensus And Federation

·Behavioral Trust ScoringBehavioral trust scoring assigns each federated peer a reputation score that updates based on observed behavior (latency, accuracy, protocol compliance, malicious actions detected) — and uses that score to gate privileges; cryptographic identity proves *who*, behavioral trust proves *whether they should be allowed*.3 min→·Byzantine Fault-Tolerant AgentsByzantine fault-tolerant (BFT) protocols handle the case where peers may not just fail but actively misbehave — returning wrong data, breaking the protocol, colluding — with the cost of needing 3f+1 peers to tolerate f bad ones; for federated agent systems with peers from untrusted parties, BFT is the right correctness model.3 min→·Consensus in Multi-Agent SystemsConsensus protocols — Raft, Byzantine, gossip — are how multiple agents agree on state, decisions, or outputs in the presence of disagreement, latency, or untrusted peers, and they are increasingly first-class primitives in modern multi-agent harnesses.2 min→·Cross-Machine Agent FederationFederation lets agents on different machines (and sometimes different organizations) collaborate on tasks while preserving each side's privacy, trust assumptions, and resource budgets — exemplified by ruflo's federation mode, which combines mTLS for transport, ed25519 for identity, gossip for membership, and Raft/BFT for shared decisions.3 min→·Gossip Protocols for AgentsGossip protocols spread information probabilistically — each peer periodically picks a few random peers and exchanges state with them, converging the cluster toward a shared view over time without any leader; for large agent populations where eventual consistency is acceptable, gossip is the right scaling strategy.3 min→·mTLS and ed25519 for Agent TrustMutual TLS (both sides authenticate via certificates) and ed25519 message signatures (compact, fast, modern) are the cryptographic substrate of federated agent systems — they are how a remote agent proves "I am who I say I am" before any meaningful interaction begins.3 min→·PII Gating and AIDefencePII gating is the harness-layer scrubbing of personally identifiable information (and secrets, credentials, sensitive metadata) from data flowing across trust boundaries; ruflo's `AIDefence` plugin is the reference implementation, identifying 14+ classes of sensitive data and either redacting, blocking, or alerting based on configured policy.3 min→·Prompt Injection Defense in HarnessesPrompt injection — adversarial text embedded in retrieved content, tool outputs, files, or messages that hijacks the agent's behavior — is defended at the harness layer through a defense-in-depth stack: input sanitization, content provenance tracking, tool permission scoping, hook-based blocking, and behavioral monitoring.3 min→·Raft for AgentsRaft is a distributed-consensus protocol that elects a leader from a peer group and serializes all decisions through that leader, with a clean recovery story when the leader fails — applied to agent systems, Raft gives a peer group a way to agree on shared state (a plan, a memory entry, a verdict) without trusting any single agent permanently.3 min→

Background Workers And Autopilots

·Audit and Optimize WorkersAudit workers continuously inspect recent agent activity (commits, edits, decisions) for regressions, anti-patterns, or risks; optimize workers proactively rewrite code, prompts, or configurations toward measured improvements — together they form the most-cited concrete example of the background-worker pattern.3 min→·Autopilot ModesAutopilot modes let the harness run an agent without per-action user confirmation — bounded by a budget (tokens, time, steps), gated by permission scopes, monitored by background workers, and ended by an explicit checkpoint where the user reviews — they are the UX surface that makes long-horizon agentic work practical.3 min→·The Background Worker PatternBackground workers are agents the harness runs *between* user turns — auditing recent changes, optimizing code, looking for test gaps, refreshing memory — without requiring the user to ask, and they are one of the most important emerging patterns in 2026 harnesses.2 min→·Continuous Execution LoopsA continuous-execution loop runs an agent indefinitely against a stream of tasks, events, or goals — distinct from a "session" that has a start and end — and is the runtime model that supports background workers, autopilot, federated agents, and always-on agentic services.3 min→·Event-Driven Harness ArchitecturesAn event-driven harness reacts to events — file changes, GitHub webhooks, build completions, schedule triggers — by invoking the agent loop without a user typing anything; this architecture turns a user-driven harness into an autonomous service and is the substrate for background workers, autopilot, and federated coordination.3 min→·Methodology as Plugin: ADR and DDDArchitecture Decision Records (ADR) and Domain-Driven Design (DDD) are the two most-cited "discipline" methodologies in software engineering; ruflo packages each as a plugin (`ruflo-adr`, `ruflo-ddd`) so the discipline becomes a slash command rather than a team practice.3 min→·Methodology as Plugin: SPARCSPARC (Specification, Pseudocode, Architecture, Refinement, Code) is an agent-driven engineering methodology packaged as a ruflo plugin (`ruflo-sparc`) — it is the cleanest example of how a software methodology can be encoded as a multi-step agent workflow, not just adopted as a habit.3 min→·Testgap and Coverage WorkersA testgap worker continuously identifies code without test coverage and proposes (or generates) tests; coverage workers track what's covered and aren't, surface deltas after each session, and prevent slow erosion of test quality — among the highest-leverage background workers because the work they do is something humans skip under time pressure.3 min→

Harness Economics And Comparison

·Choosing Your Harness StackThe capstone decision: pick a harness (interactive surface), decide whether you need an orchestration platform on top (multi-agent / autopilot), pick an SDK if you're building rather than using, and lean on MCP and configuration files to keep the choice reversible — most of the cost of getting it wrong is portability cost, which is partly mitigable.4 min→·Claude Code vs. Codex CLI vs. CursorSide-by-side comparison of the three dominant single-developer coding harnesses in 2026 — Claude Code (terminal-first, hooks-rich, sub-agent-capable), Codex CLI (terminal-first OpenAI counterpart, simpler primitives), Cursor (IDE-tight, agent-mode autopilot, IDE-shaped extensibility).3 min→·Harness Cost ModelsA harness's cost is dominated not by per-token model price but by how often it calls the model, how aggressively it caches the prefix, when it falls back to a cheaper model, and how many sub-agents it parallelizes — these are harness-level decisions, not model-level ones.2 min→·LangGraph vs. AutoGen vs. CrewAISide-by-side comparison of the three dominant agent frameworks in 2026 — LangGraph (graph-based, explicit state, production-leaning), AutoGen (conversational multi-agent, dialogue-centric), CrewAI (role-based, opinionated, approachable) — each shines for different problem shapes and team backgrounds.3 min→·Model Routing in HarnessesModel routing is the harness-layer decision of which model handles which turn — a small fast model for routing/classification, a large smart model for hard reasoning, a code-tuned model for coding subtasks; routing is the second-largest cost lever after caching, and a major source of harness differentiation.3 min→·OpenAI Agents SDK, Mastra, and Google ADKThe 2025 "second-wave" of agent SDKs — OpenAI Agents SDK, Mastra (TypeScript-first), and Google ADK (Agent Development Kit) — converged on a similar shape: opinionated agent + handoff + guardrail primitives sitting between bare API calls and a full framework like LangGraph; a useful comparison if you're picking an SDK.3 min→·Prompt and Context CachingPrompt caching reuses computation for repeated prefixes — system prompts, long instructions, recently-seen documents — at 5–10× cost savings on cache-hit tokens; it is the single largest cost lever in any agent system, and harness-layer prompt structure determines whether you actually capture it.3 min→·SWE-bench and Harness LeaderboardsSWE-bench is the dominant agent benchmark for software engineering tasks, and harness leaderboards (top scores published by ruflo, Aider, Devin, Cursor, OpenHands) are how the harness-layer competition is now measured — a 2026 frontier harness scoring 80%+ on SWE-bench Verified is roughly a year-over-year doubling of capability.3 min→·The 75% Savings ClaimRuflo's headline claim of "75% API cost savings vs. Claude Code direct" is plausible but conditional on workload — the savings come from prompt caching discipline + multi-provider routing + parallel tool calls + cheaper-model fallback; this concept audits the claim and shows where it does and doesn't hold.3 min→