Skip to main content
Registration is now open! Early-bird pricing available through May 5, 2026. Register now

All Accepted Demos

Arena: Benchmarking AI Agent Frameworks Under Fixed-Model Conditions

Roberto Milev (Navan), Uday Kanagala (Navan)

Evaluation & Benchmarking

Summary

An open-source benchmarking tool that evaluates agent frameworks under fixed-model conditions, finding that scenario-specific orchestration adds no measurable benefit over generic agentic loops.

Description

Existing agent benchmarks evaluate models, not the frameworks that orchestrate them, making it impossible to isolate how much performance comes from the model versus the framework's orchestration code. We present Arena, an open-source benchmarking tool that evaluates agent frameworks under fixed-model conditions. Arena fixes six frameworks — Claude Agent SDK, LangChain, LangGraph, AWS Strands, CrewAI, and Google ADK — to Claude Sonnet 4.5 on AWS Bedrock, connects them to the same MCP tool server, and scores them with a deterministic evaluator across three scenarios of increasing complexity using six metrics: code complexity, step efficiency, latency, correctness, consistency, and cost. We ask: does explicitly programming agent flows provide measurable benefit over a generic agentic loop driven by prompts? Our evaluation reveals that on simple tasks all frameworks perform comparably, but as complexity grows, traditional frameworks require 2–4× more scenario-specific orchestration code yet gain no correctness advantage. The Claude Agent SDK uses the same generic agentic loop across all scenarios; only the prompt changes. We contribute (1) a fixed-model methodology isolating framework behavor from model capability, (2) an extensible open-source tool for practitioner evaluation, and (3) empirical evidence that scenario-specific orchestration adds no measurable benefit over generic agentic loops driven by prompts.

ACM CAIS 2026 Sponsors