One-Line Summary: Statistical significance tells you whether a difference is real; effect size and practical significance tell you whether it matters -- a distinction that prevents wasted deployments and missed opportunities.

Prerequisites: sample-size-and-power-analysis.md, confidence-intervals-for-agent-metrics.md

What Is Effect Size and Practical Significance?

Imagine a pharmaceutical trial finds that a new drug lowers blood pressure by 0.5 mmHg with . The effect is "statistically significant" but clinically meaningless -- no doctor would prescribe it. Conversely, a trial finding a 15 mmHg reduction with (not significant) should not be dismissed; the sample was probably just too small. This distinction between statistical and practical significance is one of the most important -- and most frequently ignored -- concepts in agent evaluation.

In the agent evaluation context, practical significance asks: "Given the costs of deploying this new agent version (re-testing, migration, risk), does the observed improvement justify the effort?" A 2% improvement in benchmark accuracy that is statistically significant () may not warrant the operational overhead of a deployment. Meanwhile, a 10% improvement that fails to reach significance () because of limited evaluation budget may represent a genuinely important advance that deserves further investigation.

Effect size measures quantify the magnitude of a difference on a standardized scale, independent of sample size. They answer "how big is the difference?" rather than "is there a difference?" -- and for decision-making, magnitude almost always matters more than mere existence.

How It Works

Cohen's h for Proportion Differences

When comparing binary success rates -- the most common agent evaluation setting -- Cohen's measures effect size on the arcsine-transformed scale:

This transformation stabilizes variance across the range of proportions. Conventional benchmarks:

| | Interpretation | Example | |-------|---------------|---------| | 0.20 | Small effect | 70% vs 73% | | 0.50 | Medium effect | 70% vs 81% | | 0.80 | Large effect | 70% vs 90% |

For comparing an agent at and :

This is a small effect -- detectable only with large samples and unlikely to be practically important in most contexts.

Cohen's d for Continuous Metrics

For continuous metrics (cost, latency, trajectory quality scores), Cohen's standardizes the mean difference by the pooled standard deviation:

where:

Conventional benchmarks: (small), (medium), (large). For agent evaluation, these conventions are starting points -- the actual threshold for practical significance depends on the specific application.

The Cost-Benefit Framework

Practical significance requires a decision-theoretic framework. Define:

  • = observed improvement in success rate
  • = total evaluation cost to confirm the improvement
  • = deployment cost (re-testing, migration, risk)
  • = value of each successful task completion
  • = expected production volume

The improvement is practically justified if:

For example, a 2% improvement on a task worth \500.02 \times 50 \times 10{,}000 = $10{,}000C_{\text{eval}} + C_{\text{deploy}} = $5{,}000$100$/month -- not worth the deployment cost.

Minimum Detectable Effect (MDE)

The MDE is the smallest effect size your evaluation can reliably detect given its budget. For a two-proportion z-test:

For , , , :

Your evaluation can detect ~9 percentage-point differences but is blind to anything smaller. If the expected improvement is 3-5%, this evaluation design is underpowered. The MDE should be computed before running evaluations and compared against the minimum improvement that would be practically significant.

Confidence Intervals as Effect Size Communication

A well-constructed confidence interval for the difference communicates both statistical and practical significance simultaneously:

Overlay this interval on a region of practical significance :

  • If the entire CI is above : practically and statistically significant.
  • If the CI includes but excludes 0: statistically significant but uncertain practical significance.
  • If the CI includes 0 but the upper bound exceeds : inconclusive; need more data.
  • If the entire CI is below : the improvement, even if real, is too small to matter.

Why It Matters

  1. Prevents wasted deployments: Shipping a statistically significant but trivially small improvement wastes engineering time and introduces deployment risk for negligible gain.
  2. Rescues promising results: A practically meaningful improvement that fails to reach significance due to small samples should trigger more evaluation investment, not rejection.
  3. Enables rational budget allocation: The MDE connects evaluation budget to the smallest improvement worth detecting, preventing both over- and under-investment.
  4. Improves communication: Reporting effect sizes alongside p-values gives stakeholders an intuitive sense of magnitude that p-values alone cannot provide.
  5. Aligns evaluation with business value: The cost-benefit framework connects statistical results directly to organizational decision-making.

Key Technical Details

  • Always report both: p-values and effect sizes serve complementary functions. A significant result with a tiny effect size is noteworthy for different reasons than a significant result with a large effect size.
  • Effect size CI: Report confidence intervals for effect sizes, not just point estimates. For Cohen's : .
  • Equivalence testing (TOST): To actively demonstrate that two agents are equivalent (not just that you failed to detect a difference), use two one-sided tests. Reject if both one-sided tests are significant.
  • Non-inferiority margins: In many settings, the new agent need not be better -- just not worse by more than . Non-inferiority testing uses and is more appropriate for regression testing than superiority testing.
  • Domain-specific thresholds: A 2% improvement on safety metrics may be critically important; a 2% improvement on a convenience metric may be irrelevant. Practical significance thresholds should be metric-specific.

Common Misconceptions

  • "If it's statistically significant, it's important." Statistical significance is a function of sample size. With enough data, any non-zero difference becomes significant. A p-value tells you about evidence against the null, not about the magnitude or importance of the effect.
  • "If it's not significant, there's no effect." Non-significance with low power simply means inconclusive. Calculate the power of your test; if it is below 80% for the effect size of interest, the non-significant result is uninformative.
  • "Cohen's benchmarks (small/medium/large) are universal." Cohen himself called these benchmarks "a last resort" when domain-specific standards are unavailable. In agent evaluation, a "small" effect on a safety metric may be enormous in practical terms, while a "large" effect on a minor convenience metric may be negligible.
  • "The p-value is the probability the result is due to chance." The p-value is , not . This distinction matters when base rates of true effects vary.

Connections to Other Concepts

  • sample-size-and-power-analysis.md -- Power analysis requires specifying the minimum effect size of interest, which should be the MDE aligned with practical significance.
  • confidence-intervals-for-agent-metrics.md -- CIs for the difference between agents communicate both significance and effect size in a single visual.
  • regression-detection-statistics.md -- Regression thresholds should be based on practical significance, not arbitrary statistical conventions.
  • stratified-evaluation-design.md -- Effect sizes may vary dramatically across strata; a practically significant improvement in hard tasks may coexist with no change in easy tasks.
  • ../06-cost-quality-latency-tradeoffs/cost-of-evaluation.md -- The cost-benefit framework ties effect size directly to evaluation economics.

Further Reading

  • "Statistical Power Analysis for the Behavioral Sciences" -- Jacob Cohen, 1988
  • "The Earth Is Round (p < .05)" -- Jacob Cohen, 1994
  • "Moving to a World Beyond 'p < 0.05'" -- Ronald L. Wasserstein, Allen L. Schirm, Nicole A. Lazar, 2019
  • "Testing Statistical Hypotheses of Equivalence and Noninferiority" -- Stefan Wellek, 2010
  • "The New Statistics: Why and How" -- Geoff Cumming, 2014