RESEARCHAEGIS

Open research on
agent reliability.

We publish our benchmarks, eval methodologies, and infrastructure findings openly. Good science requires reproducibility.

Papers

Open repos

Benchmark tasks

Publications

BenchmarkingEval MethodologyFeb 2025

AgentBench v2: A Standardized Benchmark for Production Agent Evaluation

We present AgentBench v2, a standardized eval suite of 47 tasks designed to measure agent reliability in production-like conditions. Unlike capability benchmarks, AgentBench v2 focuses on consistency, regression detection, and tool-call correctness.

ByAmara Osei,Priya Sundaram

Read paper →

InfrastructureSecurityFirecrackerJan 2025

Sandbox Isolation for Agent Workloads: Performance vs. Security Trade-offs

We characterize the performance overhead of full VM isolation for AI agent eval workloads using Firecracker microVMs. We show that with targeted optimizations, cold start times can be reduced to under 2 seconds while maintaining complete memory and network isolation.

ByLeo Hartmann,Priya Sundaram

Read paper →

Eval MethodologyCI/CDDec 2024

Regression Detection in LLM-based Agent Systems

We study the problem of detecting behavioral regressions in language-model based agents across software releases. We propose a pass-rate delta threshold method and evaluate it on 6 real production agent codebases.

ByAmara Osei,Aryan Mehta

Read paper →

Open Source