One-Line Summary: Statistical significance tells you whether a difference is real; effect size and practical significance tell you whether it matters -- a distinction that prevents wasted deployments and missed opportunities.
Prerequisites: sample-size-and-power-analysis.md, confidence-intervals-for-agent-metrics.md
What Is Effect Size and Practical Significance?
Imagine a pharmaceutical trial finds that a new drug lowers blood pressure by 0.5 mmHg with . The effect is "statistically significant" but clinically meaningless -- no doctor would prescribe it. Conversely, a trial finding a 15 mmHg reduction with (not significant) should not be dismissed; the sample was probably just too small. This distinction between statistical and practical significance is one of the most important -- and most frequently ignored -- concepts in agent evaluation.
In the agent evaluation context, practical significance asks: "Given the costs of deploying this new agent version (re-testing, migration, risk), does the observed improvement justify the effort?" A 2% improvement in benchmark accuracy that is statistically significant () may not warrant the operational overhead of a deployment. Meanwhile, a 10% improvement that fails to reach significance () because of limited evaluation budget may represent a genuinely important advance that deserves further investigation.
Effect size measures quantify the magnitude of a difference on a standardized scale, independent of sample size. They answer "how big is the difference?" rather than "is there a difference?" -- and for decision-making, magnitude almost always matters more than mere existence.
How It Works
Cohen's h for Proportion Differences
When comparing binary success rates -- the most common agent evaluation setting -- Cohen's measures effect size on the arcsine-transformed scale:
This transformation stabilizes variance across the range of proportions. Conventional benchmarks:
| | Interpretation | Example | |-------|---------------|---------| | 0.20 | Small effect | 70% vs 73% | | 0.50 | Medium effect | 70% vs 81% | | 0.80 | Large effect | 70% vs 90% |
For comparing an agent at and :
This is a small effect -- detectable only with large samples and unlikely to be practically important in most contexts.
Cohen's d for Continuous Metrics
For continuous metrics (cost, latency, trajectory quality scores), Cohen's standardizes the mean difference by the pooled standard deviation:
where:
Conventional benchmarks: (small), (medium), (large). For agent evaluation, these conventions are starting points -- the actual threshold for practical significance depends on the specific application.
The Cost-Benefit Framework
Practical significance requires a decision-theoretic framework. Define:
- = observed improvement in success rate
- = total evaluation cost to confirm the improvement
- = deployment cost (re-testing, migration, risk)
- = value of each successful task completion
- = expected production volume
The improvement is practically justified if:
For example, a 2% improvement on a task worth \500.02 \times 50 \times 10{,}000 = $10{,}000C_{\text{eval}} + C_{\text{deploy}} = $5{,}000$100$/month -- not worth the deployment cost.
Minimum Detectable Effect (MDE)
The MDE is the smallest effect size your evaluation can reliably detect given its budget. For a two-proportion z-test:
For , , , :
Your evaluation can detect ~9 percentage-point differences but is blind to anything smaller. If the expected improvement is 3-5%, this evaluation design is underpowered. The MDE should be computed before running evaluations and compared against the minimum improvement that would be practically significant.
Confidence Intervals as Effect Size Communication
A well-constructed confidence interval for the difference communicates both statistical and practical significance simultaneously:
Overlay this interval on a region of practical significance :
- If the entire CI is above : practically and statistically significant.
- If the CI includes but excludes 0: statistically significant but uncertain practical significance.
- If the CI includes 0 but the upper bound exceeds : inconclusive; need more data.
- If the entire CI is below : the improvement, even if real, is too small to matter.
Why It Matters
- Prevents wasted deployments: Shipping a statistically significant but trivially small improvement wastes engineering time and introduces deployment risk for negligible gain.
- Rescues promising results: A practically meaningful improvement that fails to reach significance due to small samples should trigger more evaluation investment, not rejection.
- Enables rational budget allocation: The MDE connects evaluation budget to the smallest improvement worth detecting, preventing both over- and under-investment.
- Improves communication: Reporting effect sizes alongside p-values gives stakeholders an intuitive sense of magnitude that p-values alone cannot provide.
- Aligns evaluation with business value: The cost-benefit framework connects statistical results directly to organizational decision-making.
Key Technical Details
- Always report both: p-values and effect sizes serve complementary functions. A significant result with a tiny effect size is noteworthy for different reasons than a significant result with a large effect size.
- Effect size CI: Report confidence intervals for effect sizes, not just point estimates. For Cohen's : .
- Equivalence testing (TOST): To actively demonstrate that two agents are equivalent (not just that you failed to detect a difference), use two one-sided tests. Reject if both one-sided tests are significant.
- Non-inferiority margins: In many settings, the new agent need not be better -- just not worse by more than . Non-inferiority testing uses and is more appropriate for regression testing than superiority testing.
- Domain-specific thresholds: A 2% improvement on safety metrics may be critically important; a 2% improvement on a convenience metric may be irrelevant. Practical significance thresholds should be metric-specific.
Common Misconceptions
- "If it's statistically significant, it's important." Statistical significance is a function of sample size. With enough data, any non-zero difference becomes significant. A p-value tells you about evidence against the null, not about the magnitude or importance of the effect.
- "If it's not significant, there's no effect." Non-significance with low power simply means inconclusive. Calculate the power of your test; if it is below 80% for the effect size of interest, the non-significant result is uninformative.
- "Cohen's benchmarks (small/medium/large) are universal." Cohen himself called these benchmarks "a last resort" when domain-specific standards are unavailable. In agent evaluation, a "small" effect on a safety metric may be enormous in practical terms, while a "large" effect on a minor convenience metric may be negligible.
- "The p-value is the probability the result is due to chance." The p-value is , not . This distinction matters when base rates of true effects vary.
Connections to Other Concepts
sample-size-and-power-analysis.md-- Power analysis requires specifying the minimum effect size of interest, which should be the MDE aligned with practical significance.confidence-intervals-for-agent-metrics.md-- CIs for the difference between agents communicate both significance and effect size in a single visual.regression-detection-statistics.md-- Regression thresholds should be based on practical significance, not arbitrary statistical conventions.stratified-evaluation-design.md-- Effect sizes may vary dramatically across strata; a practically significant improvement in hard tasks may coexist with no change in easy tasks.../06-cost-quality-latency-tradeoffs/cost-of-evaluation.md-- The cost-benefit framework ties effect size directly to evaluation economics.
Further Reading
- "Statistical Power Analysis for the Behavioral Sciences" -- Jacob Cohen, 1988
- "The Earth Is Round (p < .05)" -- Jacob Cohen, 1994
- "Moving to a World Beyond 'p < 0.05'" -- Ronald L. Wasserstein, Allen L. Schirm, Nicole A. Lazar, 2019
- "Testing Statistical Hypotheses of Equivalence and Noninferiority" -- Stefan Wellek, 2010
- "The New Statistics: Why and How" -- Geoff Cumming, 2014