Multi-Step Plan Evaluation

One-Line Summary: Evaluating an agent's plan — separately from evaluating its execution — lets you detect bad plans before they burn tokens, and lets you compare planning strategies; the harness usually exposes evaluation as a hook between plan generation and execution.

Prerequisites: Plan graphs vs plan strings, trajectory evaluation

What Is Plan Evaluation?

When an agent produces a plan, you have a chance to ask: is this plan any good? Before any tool is called, before any tokens are spent on execution, you can score the plan against criteria — does it cover the task, is it minimal, does it have obvious flaws, is each step's precondition met, is the cost estimate reasonable.

This is distinct from trajectory evaluation, which scores the whole rollout (plan + execution + outcome). Plan evaluation is upstream; it catches problems earlier.

How It Works

A plan evaluator is a function from (task, plan) → (score, feedback). Common implementations:

Rule-based: Static checks — does the plan have at most N steps, does each step reference a real tool, are preconditions reachable.
LLM-as-judge: A separate LLM scores the plan against a rubric. Slower and more expensive, but catches subtler issues.
Historical-comparison: Score the plan against past successful trajectories on similar tasks. Requires a trajectory store (see trajectory-learning.md).
Self-critique: The same agent that produced the plan reviews it. Cheap; biased toward what the agent already did.

The evaluator's verdict gates execution: high-quality plans proceed; low-quality plans trigger a re-plan or a user check-in.

Why It Matters

Plans fail before they execute more often than people expect. Catching obvious-bad-plans before execution is one of the highest-leverage interventions in agent systems — it saves tokens, latency, and user trust. It also gives you a measurable training target: if you log plans-and-evaluations, you can study which planners produce evaluator-rejected plans most often and tune them.

Key Technical Details

Evaluator as gate: The simplest deployment is "evaluator returns score; if below threshold, re-plan." Don't over-engineer beyond that.
Rubrics matter more than the model: A clear rubric ("plans should have ≤ 7 steps; each step should name one tool; preconditions should be explicit") drives evaluation quality more than the evaluator model size.
Cost vs. coverage: LLM-as-judge plan evaluation is cheap relative to executing a bad plan. Even at $0.10 per evaluation, it pays off if it catches one bad plan in ten.
Independence: The evaluator should not be the same agent (or share context with) the planner. Otherwise it just rubber-stamps.
Adversarial plans: A planner that learns the evaluator's preferences can produce plans that score well but execute poorly. Rotate evaluators or randomize rubric weights to mitigate.
Fail-open vs. fail-closed: When the evaluator times out, do you proceed with the plan or pause? Default to fail-open with a logged warning unless the task is high-stakes.

How Harnesses & Frameworks Implement This

Harness / Framework	Plan evaluation support
Claude Code	DIY — implement as a `PreToolUse` hook on plan-shaped tools
Claude Agent SDK	DIY — programmatic
ruflo	First-class — methodology plugins (e.g. `ruflo-sparc`) include plan evaluation
LangGraph	DIY — add an evaluator node before execution nodes
AutoGen	DIY
CrewAI	DIY
OpenAI Agents SDK	Guardrails can act as plan evaluators
Codex CLI	✗
Cursor	✗

Connections to Other Concepts

plan-graphs-vs-plan-strings.md — Graphs are easier to evaluate than strings.
adaptive-replanning.md — Failed evaluation triggers replanning.
trajectory-learning.md — Evaluation results train the planner.
../../ai-agent-evaluation/04-trajectory-and-process-analysis/trajectory-evaluation.md — Foundational coverage of execution-side evaluation.