Regression Detection Statistics

One-Line Summary: Regression detection uses hypothesis testing and sequential analysis to distinguish genuine performance drops from natural variance, balancing fast detection against false alarms.

Prerequisites: sample-size-and-power-analysis.md, confidence-intervals-for-agent-metrics.md, effect-size-and-practical-significance.md

What Is Regression Detection?

Imagine a factory quality inspector who checks products coming off an assembly line. They need to sound the alarm when quality drops, but not every defective item means the machine is broken -- some defects are normal. The challenge is detecting real problems quickly while avoiding false alarms that halt production unnecessarily. Agent regression detection is exactly this problem applied to AI systems.

When you update an agent -- changing the underlying model, modifying prompts, adjusting tool definitions, or altering the orchestration logic -- you need to know whether the new version is worse than the old one. A naive approach compares success rates directly ("old: 74%, new: 71% -- regression!"), but natural variance means a 3% drop can easily occur by chance alone. Regression detection statistics formalize the decision process, controlling both the probability of missing a real regression (Type II error) and the probability of flagging a false one (Type I error).

The practical challenge is speed. In a fast-iteration development cycle, you want to detect regressions with as few evaluation runs as possible. Classical fixed-sample tests require committing to a sample size upfront. Sequential methods -- the workhorse of modern regression detection -- let you evaluate evidence as data arrives, often reaching a decision 30-50% faster.

How It Works

Chi-Squared Test for Success Rate Comparison

The most basic approach compares success rates between agent versions using a chi-squared test. Given $n_{1}$ trials of the old agent with $x_{1}$ successes and $n_{2}$ trials of the new agent with $x_{2}$ successes:

$χ^{2} = \frac{( x _{1} n _{2} - x _{2} n _{1} ) ^{2} \cdot ( n _{1} + n _{2} )}{n _{1} \cdot n _{2} \cdot ( x _{1} + x _{2} ) ( n _{1} + n _{2} - x _{1} - x _{2} )}$

Reject the null hypothesis of equal rates if $χ^{2} > χ_{1, α}^{2}$ . For a one-sided test (detecting regression specifically), use $z = χ^{2}$ and compare against $z_{α}$ .

For small samples, Fisher's exact test is preferable:

$p = \frac{( x _{1} x _{1} + x _{2} ) ( n _{1} - x _{1} n _{1} + n _{2} - x _{1} - x _{2} )}{( n _{1} n _{1} + n _{2} )}$

Sequential Probability Ratio Test (SPRT)

SPRT evaluates evidence continuously as each data point arrives. For binary outcomes, define the log-likelihood ratio after $n$ observations:

$Λ_{n} = \sum_{i = 1}^{n} lo g \frac{P ( Y _{i} ∣ p = p _{1} )}{P ( Y _{i} ∣ p = p _{0} )}$

where $p_{0}$ is the baseline success rate and $p_{1} = p_{0} - δ$ is the regression threshold. For binary outcomes:

$Λ_{n} = s_{n} lo g \frac{p _{1}}{p _{0}} + (n - s_{n}) lo g \frac{1 - p _{1}}{1 - p _{0}}$

where $s_{n}$ is the cumulative number of successes. The decision boundaries are:

Declare regression if $Λ_{n} \leq lo g \frac{β}{1 - α}$ (lower boundary $B$ )
Declare no regression if $Λ_{n} \geq lo g \frac{1 - β}{α}$ (upper boundary $A$ )
Continue testing if $B < Λ_{n} < A$

For $α = 0.05$ and $β = 0.20$ : $A \approx 2.77$ and $B \approx - 1.39$ .

SPRT reaches decisions in 30-50% fewer samples than fixed-sample tests on average, making it ideal for expensive agent evaluations.

CUSUM (Cumulative Sum) for Monitoring

For ongoing monitoring of deployed agents, CUSUM detects shifts in the mean performance over time. Define the cumulative sum:

$S_{n} = max (0, S_{n - 1} + (μ_{0} - Y_{n}) - \frac{δ}{2})$

where $μ_{0}$ is the target performance level and $δ$ is the minimum shift to detect. Signal an alarm when $S_{n} > h$ , where $h$ is calibrated to control the average run length (ARL) under no-change conditions.

The ARL under the null (no regression) should be set based on your tolerance for false alarms. For daily monitoring with weekly acceptable false alarm rate:

$ARL_{0} \approx \frac{7}{α _{daily}}$

Multiple Comparison Correction

When monitoring $m$ metrics simultaneously (success rate, cost, latency, safety violations), the family-wise error rate inflates. Corrections include:

Bonferroni: Test each metric at $α^{'} = α / m$ . Simple but conservative.
Holm-Bonferroni: Sequentially rejective, uniformly more powerful than Bonferroni.
Benjamini-Hochberg (FDR): Controls the false discovery rate rather than the family-wise error rate. For 10+ metrics, this is substantially more powerful: $p_{(i)} \leq \frac{i}{m} \cdot α$ .

Choosing Significance Thresholds

The conventional $α = 0.05$ is not universally appropriate:

Context	Recommended $α$	Rationale
Daily development iteration	0.10	Fast detection; easy to revert
Pre-release validation	0.05	Standard balance
Safety-critical deployment	0.01	Low false negative tolerance
Production monitoring	Calibrate via ARL	Control false alarm frequency

Why It Matters

Prevents shipping regressions: Without statistical tests, teams rely on intuition ("it looks about the same"), which systematically fails for small-to-medium regressions.
Enables fast iteration: Sequential methods let teams make confident ship/no-ship decisions in hours rather than days, directly accelerating development velocity.
Controls false alarms: A team that cries wolf too often -- flagging noise as regressions -- loses trust in the evaluation system. Formal error control prevents this.
Supports automated pipelines: CI/CD pipelines for agents need programmatic go/no-go decisions. Statistical tests provide the formal decision rule.

Key Technical Details

One-sided vs two-sided tests: For regression detection, use one-sided tests ( $H_{1} : p_{new} < p_{old}$ ). Two-sided tests waste power on detecting improvements, which is not the goal during regression testing.
SPRT maximum sample size: Truncated SPRT adds a maximum sample size $N_{m a x}$ to guarantee termination. Set $N_{m a x}$ to 2-3x the fixed-sample equivalent.
Paired vs unpaired: If both agent versions are evaluated on the same tasks, use McNemar's test for binary outcomes: $χ^{2} = \frac{( b - c ) ^{2}}{b + c}$ where $b$ and $c$ are discordant pairs. This substantially increases power.
Effect of non-stationarity: Agent performance can drift over time due to external API changes. Detrend data before applying change-point detection.
Combining SPRT with CUSUM: Use SPRT for version-to-version comparisons and CUSUM for continuous monitoring. They complement each other.

Common Misconceptions

"If the p-value is 0.06, we are safe to ship." A p-value of 0.06 is weak evidence against the null, not evidence for it. With low power (small $n$ ), a true 5% regression might easily produce $p = 0.06$ . Always report power alongside the p-value.
"We tested 15 metrics and found one regression at p < 0.05, so something is wrong." With 15 independent tests at $α = 0.05$ , you expect $0.75$ false positives. A single significant result is entirely consistent with no regression. Apply multiple comparison correction.
"Sequential testing lets you peek at the data as much as you want." SPRT controls error rates for the specific sequential decision boundaries. If you compute a p-value at each step without adjusting, your actual Type I error rate can be 2-4x the nominal level. Use the correct sequential boundaries.
"A/B testing methods from web experiments transfer directly to agent evaluation." Web A/B tests assume IID observations. Agent evaluation runs on the same task are correlated, and tasks vary enormously in difficulty. Account for this structure or risk invalid conclusions.

Connections to Other Concepts

sample-size-and-power-analysis.md -- Power analysis determines the sample size for fixed-sample regression tests and the expected stopping time for sequential tests.
effect-size-and-practical-significance.md -- The minimum detectable effect should reflect practical significance, not convenience.
confidence-intervals-for-agent-metrics.md -- CIs for the difference between versions provide a complementary view to hypothesis tests.
variance-decomposition.md -- High environment variance inflates the noise floor, requiring larger samples or more stringent variance control.
../09-production-evaluation-and-monitoring/online-monitoring.md -- CUSUM and SPRT are the statistical backbone of production monitoring systems.