Open research on
agent reliability.
We publish our benchmarks, eval methodologies, and infrastructure findings openly. Good science requires reproducibility.
AgentBench v2: A Standardized Benchmark for Production Agent Evaluation
We present AgentBench v2, a standardized eval suite of 47 tasks designed to measure agent reliability in production-like conditions. Unlike capability benchmarks, AgentBench v2 focuses on consistency, regression detection, and tool-call correctness.
Sandbox Isolation for Agent Workloads: Performance vs. Security Trade-offs
We characterize the performance overhead of full VM isolation for AI agent eval workloads using Firecracker microVMs. We show that with targeted optimizations, cold start times can be reduced to under 2 seconds while maintaining complete memory and network isolation.
Regression Detection in LLM-based Agent Systems
We study the problem of detecting behavioral regressions in language-model based agents across software releases. We propose a pass-rate delta threshold method and evaluate it on 6 real production agent codebases.
Everything we build,
open by default.
The AgentBench eval suite — 47 tasks, open spec, reproducible scoring.
PythonOfficial TypeScript + Python SDK for the Aegis eval API.
TypeScriptThe open-source Firecracker sandbox runner used inside Aegis.
RustPluggable eval scorer library: exact match, LLM-as-judge, tool-call validators.
Python