One-Line Summary: Confidence intervals transform meaningless point estimates like "72% success rate" into informative statements like "72% +/- 4.2% (95% CI)," making uncertainty explicit and comparisons honest.
Prerequisites: sample-size-and-power-analysis.md, ../01-foundations-of-agent-evaluation/what-is-agent-evaluation.md
What Is a Confidence Interval for Agent Metrics?
Imagine a weather forecast that says "tomorrow's temperature will be 68 degrees." Now imagine one that says "68 degrees, give or take 3 degrees." The second forecast is far more useful because it communicates uncertainty. Agent evaluation faces the same problem: reporting "our agent scores 72% on SWE-bench" without a confidence interval is like the first forecast -- it sounds precise but conceals how much you actually know.
A confidence interval (CI) provides a range of plausible values for the true performance metric, given the observed data. A 95% CI means that if you repeated the entire evaluation procedure many times, 95% of the resulting intervals would contain the true parameter. For agent evaluation, this is critical because stochastic sampling (temperature > 0), environment variability, and limited task sets all inject noise into observed metrics.
The choice of CI method matters more than most practitioners realize. The standard normal approximation breaks down badly for proportions near 0 or 1 -- exactly where high-performing agents operate. Using the wrong interval can produce impossible results (like a lower bound below 0%) or intervals that are far too narrow, giving false confidence.
How It Works
The Wald (Normal Approximation) Interval
The simplest approach uses the Central Limit Theorem:
where is the observed success rate and is the number of trials. For and :
This gives a 95% CI of . The Wald interval is easy to compute but has well-documented problems: it undercovers (actual coverage < 95%) for small and extreme , and it can produce bounds outside .
The Wilson Score Interval
The Wilson interval inverts the score test and performs substantially better:
For the same example (, , ):
The Wilson interval is always contained in , has better coverage properties for small samples, and handles extreme proportions gracefully. It should be the default for agent evaluation.
The Clopper-Pearson Exact Interval
For conservative guarantees, the Clopper-Pearson interval uses the exact binomial distribution:
where is the beta distribution quantile function and is the number of successes. This interval guarantees at least coverage for any true , but it is conservative (often wider than necessary). Use it when you need guaranteed coverage -- for example, in safety-critical evaluations where undercoverage is unacceptable.
Bootstrap Confidence Intervals for Complex Metrics
Many agent metrics are not simple proportions. Trajectory quality scores, cost efficiency, latency distributions, and composite metrics require more flexible methods. The bootstrap procedure:
- From your evaluation results, draw bootstrap samples (typically ) of size with replacement.
- Compute the metric for each bootstrap sample .
- The percentile interval is .
For improved accuracy, use the bias-corrected and accelerated (BCa) bootstrap:
where and are adjusted quantiles accounting for bias and acceleration :
The BCa bootstrap is the recommended general-purpose method for non-standard agent metrics.
Why It Matters
- Honest reporting: A 72% score from 50 runs () tells a fundamentally different story than 72% from 500 runs (). Without CIs, these are indistinguishable.
- Meaningful comparisons: Two agents scoring 72% and 68% look different, but if their 95% CIs overlap substantially, you cannot conclude either is better.
- Decision support: Deployment decisions should depend on whether the lower bound of the CI exceeds an acceptable threshold, not whether the point estimate does.
- Reproducibility: CIs make it immediately obvious when a result is based on too little data, incentivizing adequate sample sizes.
- Regulatory and publication standards: As agent evaluation matures, reporting standards increasingly require uncertainty quantification.
Key Technical Details
- Wilson vs Wald: Always prefer Wilson for binary metrics. The Wald interval has actual coverage as low as 85% when nominal is 95%, especially for or .
- Bootstrap sample count: Use for publication-quality CIs. For quick iteration, is acceptable.
- Simultaneous CIs: When reporting CIs for metrics simultaneously, apply the Bonferroni correction () or use Scheffe's method to maintain family-wise coverage.
- Clustered data: If runs within a task are correlated, use cluster-robust standard errors: where is the design effect.
- Reporting format: Always report: metric name, point estimate, CI bounds, confidence level, sample size, and CI method. Example: "Success rate: 72.0% [67.8%, 76.2%] (95% Wilson CI, )."
Common Misconceptions
- "A 95% CI means there's a 95% probability the true value is in this interval." The true value is fixed; the interval is random. The correct interpretation is that 95% of intervals constructed this way contain the true value. This frequentist nuance matters when making decisions.
- "Overlapping confidence intervals mean no significant difference." Two 95% CIs can overlap and the difference can still be statistically significant at . The correct approach is to compute a CI for the difference directly: .
- "The normal approximation is fine for large samples." Even at , if , the Wald interval can have actual coverage below 90%. Always use Wilson or Clopper-Pearson for extreme proportions.
- "Bootstrap CIs are always more accurate." The basic percentile bootstrap can have poor coverage for small . Use BCa or studentized bootstrap for reliable results.
Connections to Other Concepts
sample-size-and-power-analysis.md-- Sample size directly controls CI width; the relationship is .effect-size-and-practical-significance.md-- CIs naturally encode practical significance: if the entire CI falls above a meaningful threshold, the effect is both statistically and practically significant.variance-decomposition.md-- Variance components determine which source of uncertainty dominates the CI width.regression-detection-statistics.md-- CIs for the difference between versions are the foundation of regression detection.../08-evaluation-tooling-and-infrastructure/evaluation-reporting.md-- Proper CI reporting is a core component of evaluation reports.
Further Reading
- "Interval Estimation for a Binomial Proportion" -- Lawrence D. Brown, T. Tony Cai, Anirban DasGupta, 2001
- "Approximate Is Better than 'Exact' for Interval Estimation of Binomial Proportions" -- Alan Agresti, Brent A. Coull, 1998
- "Bootstrap Methods and Their Application" -- A.C. Davison, D.V. Hinkley, 1997
- "Better Approximate Confidence Intervals for a Binomial Parameter" -- Robert G. Newcombe, 1998