One-Line Summary: Power analysis determines how many evaluation runs you need to draw statistically valid conclusions about agent performance, balancing rigor against cost.

Prerequisites: ../01-foundations-of-agent-evaluation/what-is-agent-evaluation.md, confidence-intervals-for-agent-metrics.md

What Is Sample Size and Power Analysis?

Imagine you flip a coin 3 times and get 2 heads. Would you conclude the coin is biased? Probably not -- 3 flips is far too few to distinguish bias from luck. Agent evaluation faces the same problem: running an agent on 10 tasks and observing a 70% success rate tells you almost nothing about its true capability. Sample size and power analysis is the statistical framework for answering "how many runs do I actually need?"

Power analysis quantifies the relationship between four interconnected quantities: sample size (), effect size (the difference you want to detect), significance level (, your false positive tolerance), and statistical power (, your probability of detecting a real effect). Fix any three, and the fourth is determined. In agent evaluation, we typically fix , power , and the minimum effect size we care about, then solve for the required sample size.

The results are often sobering. Detecting a 5% difference between two agents with reasonable confidence requires hundreds of evaluation runs per agent -- far more than the handful of examples many teams use. This creates a fundamental tension between statistical rigor and the practical cost of running evaluations, especially when each run involves expensive LLM API calls and complex environment setups.

How It Works

Power Analysis for Binary Outcomes

Most agent evaluations produce binary outcomes: the agent either completed the task or it didn't. Comparing two agents' success rates is a two-proportion z-test. The required sample size per group is:

where , for , and for 80% power.

For the common scenario of comparing an agent at against one at (a 5 percentage-point difference):

That is approximately 780 tasks per agent to detect a 5% difference with 80% power.

Variance Near Extreme Proportions

The required sample size depends on where the proportions fall. The variance of a Bernoulli variable is , which is maximized at and approaches zero near or . This means:

  • Comparing agents at 50% vs 55%: ~ tasks per agent
  • Comparing agents at 90% vs 95%: ~ tasks per agent
  • Comparing agents at 10% vs 15%: ~ tasks per agent

As benchmarks saturate (agents approaching 95%+ success), you actually need fewer samples to detect differences -- a small silver lining.

Minimum Sample Sizes for Common Goals

GoalMinimum Rationale
Rough confidence interval30 per taskCLT approximation validity
Reliable agent comparison100+ per taskDetect ~10% differences
Precise comparison (5%)780+ per task80% power at
Near-ceiling discrimination400+ per taskDetect 3% differences near

The Cost Equation

Every evaluation run has a cost. At dollars per run, the total evaluation budget for comparing two agents is:

where is the number of distinct tasks and is runs per task. For tasks, runs, and \bar{c} = \12 \times 30 \times 100 \times 1 = $6{,}000$ per pairwise comparison. Teams iterating daily face evaluation budgets that dwarf development costs.

Practical Budget-Sample Trade-offs

The tension between statistical rigor and budget is the central challenge of evaluation design. Consider three realistic scenarios:

**Startup with limited budget ($1$/run, you can afford 500 total runs. Spread across 50 tasks with 5 runs each per agent, you can detect only ~25 percentage-point differences. Strategy: focus on a narrow task set, use paired designs, accept lower power for rapid iteration.

**Mid-size team ($1$/run across 100 tasks, you can run 25 trials per agent per task. This detects ~12 percentage-point differences with 80% power. Adequate for comparing meaningfully different agent architectures but insufficient for incremental prompt changes.

Production evaluation pipeline ($50,000/quarter): This budget supports rigorous evaluation at scale. Allocate 60% to routine regression testing (high frequency, moderate power) and 40% to deep evaluation studies (low frequency, high power with full variance decomposition).

The optimal allocation between number of tasks and runs per task depends on the variance structure. If between-task variance dominates, increase ; if within-task variance (model sampling) dominates, increase . See variance-decomposition.md for diagnosing which case applies.

Why It Matters

  1. Prevents false conclusions: Without adequate sample sizes, teams routinely ship "improvements" that are just noise, or reject real improvements that happened to produce unlucky samples.
  2. Budget planning: Power analysis lets you estimate evaluation costs before running anything, enabling informed trade-offs between precision and budget.
  3. Benchmarking credibility: Published results without power analysis or confidence intervals are scientifically incomplete. Reviewers and practitioners increasingly demand sample size justification.
  4. Iterative development: Understanding the sample/power trade-off lets teams choose faster, lower-power tests for daily iteration and reserve high-power tests for release decisions.

Key Technical Details

  • For paired designs (same tasks, two agents), use McNemar's test: where is the discordant pair proportion. Paired designs can reduce by 40-60%.
  • The continuity correction adds approximately to the sample size, which matters for small .
  • For multi-agent comparisons, apply Bonferroni correction: use for agents, substantially increasing required .
  • Sequential testing (see regression-detection-statistics.md) can reduce average sample size by 30-50% when effects are large.
  • Cluster effects: if tasks within a domain are correlated, the effective sample size is reduced by the design effect where is cluster size and is intra-cluster correlation.

Common Misconceptions

  • "30 runs is always enough." Thirty runs gives you a rough confidence interval for a single agent, but it is woefully insufficient for comparing two agents. Detecting a 5% difference requires an order of magnitude more data.
  • "If the p-value is 0.06, the result is not significant, so the agents are equal." Absence of evidence is not evidence of absence. A non-significant result with low power simply means the test was inconclusive. Always report power alongside p-values.
  • "We ran the benchmark once and got 72%, so our agent achieves 72%." A single run conflates agent capability with the specific random seed, API latency, and other stochastic factors. The true performance is a distribution, not a point.
  • "More tasks are always better than more runs per task." This depends on whether task variance or within-task variance dominates. Variance decomposition (see variance-decomposition.md) should guide allocation.

Connections to Other Concepts

  • confidence-intervals-for-agent-metrics.md -- CI width is directly determined by sample size; power analysis tells you the needed for a target CI width.
  • variance-decomposition.md -- Understanding where variance comes from informs whether to increase tasks, runs, or both.
  • effect-size-and-practical-significance.md -- Power analysis requires specifying a minimum effect size, which should reflect practical significance, not arbitrary convention.
  • regression-detection-statistics.md -- Sequential testing methods can achieve the same power with smaller expected sample sizes.
  • ../06-cost-quality-latency-tradeoffs/cost-of-evaluation.md -- The cost equation links statistical rigor directly to evaluation budget.

Further Reading

  • "Statistical Power Analysis for the Behavioral Sciences" -- Jacob Cohen, 1988
  • "The Design of Experiments" -- Ronald A. Fisher, 1935
  • "Sample Size Determination and Power" -- Thomas P. Ryan, 2013
  • "Power Analysis and Determination of Sample Size for Covariance Structure Modeling" -- Robert C. MacCallum, Michael W. Browne, Hazuki M. Sugawara, 1996