Benchmarks, automated evaluation methods, trajectory analysis, and production monitoring for AI agents.