One-Line Summary: Techniques for ensuring that changes to an agent do not break existing capabilities, including golden test sets, trajectory snapshot testing, statistical regression detection, and CI/CD integration.
Prerequisites: evaluation-with-test-suites.md, integration-testing-skill-chains.md, familiarity with CI/CD concepts
What Is Regression Testing for Agents?
Imagine a hospital that updates its treatment protocols. Before rolling out a new protocol, they verify that it does not make outcomes worse for patients who were already being treated successfully. Regression testing for agents works the same way: you maintain a set of tasks with known good results and verify that every change still handles them correctly.
Regression testing is uniquely challenging for AI agents because of non-determinism. A traditional software regression test is binary -- the function returns the right answer or it does not. An agent might produce a slightly different but equally valid answer each time. This means regression testing requires statistical approaches: instead of asking "did this test pass?", you ask "did the pass rate drop by a statistically significant amount?"
The goal is to catch two types of regressions: capability regressions (the agent can no longer do something it used to do) and efficiency regressions (the agent still succeeds but takes more steps, time, or money).
How It Works
Golden Test Sets
A golden test set is a curated collection of tasks with verified-correct outputs -- your highest-confidence regression indicators.
from dataclasses import dataclass
from pathlib import Path
import json
@dataclass
class GoldenTestCase:
test_id: str
input_prompt: str
reference_output: str
required_facts: list[str]
max_acceptable_steps: int
max_acceptable_cost: float
category: str
class GoldenSetValidator:
def __init__(self, llm_client):
self.llm = llm_client
async def validate(self, case: GoldenTestCase, output: str, steps: int, cost: float) -> dict:
output_lower = output.lower()
facts_found = [f for f in case.required_facts if f.lower() in output_lower]
fact_score = len(facts_found) / len(case.required_facts) if case.required_facts else 1.0
equivalence = await self._check_equivalence(case.reference_output, output)
efficiency_ok = steps <= case.max_acceptable_steps and cost <= case.max_acceptable_cost
return {"test_id": case.test_id, "fact_score": fact_score,
"semantic_equivalence": equivalence, "efficiency_ok": efficiency_ok,
"passed": fact_score >= 0.8 and equivalence >= 0.7 and efficiency_ok}
async def _check_equivalence(self, reference: str, candidate: str) -> float:
prompt = f"""Compare for semantic equivalence. Score 0.0-1.0.
Reference: {reference[:1500]}
Candidate: {candidate[:1500]}
Respond with JSON: {{"score": <float>, "reason": "<brief>"}}"""
result = json.loads((await self.llm.generate(prompt, max_tokens=150)).text)
return result["score"]Snapshot Testing of Trajectories
Instead of just checking final output, snapshot the entire agent trajectory and compare against a known-good reference.
@dataclass
class TrajectorySnapshot:
test_id: str
tools_used: list[str]
total_steps: int
total_duration_ms: int
final_output_hash: str
class TrajectoryComparator:
def compare(self, reference: TrajectorySnapshot, current: TrajectorySnapshot) -> dict:
ref_tools = set(reference.tools_used)
cur_tools = set(current.tools_used)
step_delta = current.total_steps - reference.total_steps
step_regression = step_delta > max(2, reference.total_steps * 0.5)
duration_ratio = current.total_duration_ms / max(reference.total_duration_ms, 1)
return {
"tools_added": list(cur_tools - ref_tools),
"tools_removed": list(ref_tools - cur_tools),
"step_delta": step_delta,
"step_regression": step_regression,
"duration_ratio": duration_ratio,
"has_regression": step_regression or duration_ratio > 2.0 or len(ref_tools - cur_tools) > 0,
}Statistical Regression Detection
Non-determinism requires statistical methods to distinguish real regressions from random variation.
import numpy as np
from scipy import stats
class RegressionDetector:
def __init__(self, significance_level: float = 0.05):
self.alpha = significance_level
def detect_completion_rate_regression(
self, baseline: list[bool], current: list[bool]
) -> dict:
"""Two-proportion z-test: has completion rate significantly decreased?"""
n1, n2 = len(baseline), len(current)
p1, p2 = sum(baseline) / n1, sum(current) / n2
p_pool = (sum(baseline) + sum(current)) / (n1 + n2)
if p_pool in (0, 1):
return {"regression_detected": False, "p_value": 1.0}
se = np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
z = (p1 - p2) / se if se > 0 else 0
p_value = 1 - stats.norm.cdf(z)
return {"regression_detected": p_value < self.alpha, "p_value": p_value,
"baseline_rate": p1, "current_rate": p2, "delta": p2 - p1}
def detect_efficiency_regression(
self, baseline_steps: list[int], current_steps: list[int]
) -> dict:
"""Welch's t-test: has average step count significantly increased?"""
t_stat, p_two = stats.ttest_ind(current_steps, baseline_steps, equal_var=False)
p_value = p_two / 2
return {"regression_detected": p_value < self.alpha and t_stat > 0,
"baseline_mean": np.mean(baseline_steps),
"current_mean": np.mean(current_steps)}CI/CD Integration
Integrate regression testing into your deployment pipeline so regressions are caught before production.
# .github/workflows/agent-regression.yml
name: Agent Regression Tests
on:
pull_request:
paths: ['agent/**', 'prompts/**', 'tools/**']
jobs:
regression-eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.11' }
- run: pip install -r requirements-test.txt
- name: Run regression suite (3 trials)
env: { OPENAI_API_KEY: '${{ secrets.OPENAI_API_KEY }}' }
run: |
python scripts/run_regression.py \
--suite golden_tests.json --trials 3 \
--baseline results/baseline.json \
--output results/current.json --significance 0.05
- name: Check for regressions
run: python scripts/check_regression.py --baseline results/baseline.json \
--current results/current.json --fail-on-regressionWhy It Matters
Preventing Silent Degradation
Agent systems have many knobs -- prompts, tool configurations, model versions, temperature settings. Changing one knob can silently break a capability that no one tests manually. Without regression testing, these regressions accumulate until users complain. With it, you catch them before deployment.
Confidence to Iterate
Regression tests give developers confidence to make changes. Without them, teams become afraid to update prompts or refactor tool code because they cannot verify existing capabilities still work. A solid regression suite acts as a safety net that enables rapid iteration.
Key Technical Details
- Run the golden test set at least 3 times per evaluation to account for non-determinism
- 30 tasks x 3 trials = 90 data points, sufficient to detect a 15% regression at p < 0.05
- To detect a 10% regression reliably, you need ~200 data points (50 tasks x 4 trials)
- Store baseline results in version control alongside the test suite
- Update baselines explicitly when performance improves -- never auto-update
- Track both capability metrics (pass rate) and efficiency metrics (steps, cost, latency)
- Budget $3-15 per regression run depending on suite size and model
Common Misconceptions
"Non-determinism makes regression testing impossible": Non-determinism makes individual results unreliable, not the aggregate. With enough trials, statistical tests reliably detect real regressions. A 5% drop measured across 90 trials is statistically distinguishable from random variation.
"You should update the golden set every time the agent improves": Baselines should be updated deliberately, not automatically. Manually verify improvements are real before updating. Automatic updates would silently accept degradations that coincide with other changes.
Connections to Other Concepts
evaluation-with-test-suites.md-- Evaluation suites provide the tasks; regression testing adds temporal comparisonunit-testing-individual-skills.md-- Unit tests run first in CI; regression tests run afterintegration-testing-skill-chains.md-- Integration tests catch structural regressions; regression tests catch behavioral ones
Further Reading
- Ribeiro et al., "Beyond Accuracy: Behavioral Testing of NLP Models with CheckList" (2020) -- Regression-aware testing of language models
- Breck et al., "The ML Test Score" (2017) -- Google's framework for ML system testing maturity
- Srivastava et al., "Beyond the Imitation Game" (2022) -- Large-scale benchmark methodology applicable to agent evaluation