Skip to main content
Registration is now open! Early-bird pricing available through May 5, 2026. Register now

All Accepted Demos

Peeking Under the Hood of Multi-Agent Systems

Tie Ma (Beihang University), Yixi Chen (KAUST), Vaastav Anand (MPI-SWS), Alessandro Cornacchia (KAUST), Amândio R. Faustino (KAUST), Guanheng Liu (Beihang University), Shan Zhang (Beihang University), Hongbin Luo (Beihang University), Suhaib A. Fahmy (KAUST), Zafar A. Qazi (LUMS and KAUST), Marco Canini (KAUST)

Evaluation & Benchmarking Architectural Patterns & Composition

Summary

A practical toolkit for systematically comparing and tuning multi-agent system choices (backend LLMs, agent frameworks, and architectures) addressing the stochastic and failure-prone nature of real deployments.

Description

Multi-agent systems (MASes) powered by large language models (LLMs) are increasingly deployed in real applications, yet practitioners still lack a practical way to systematically compare and tune key system choices such as backend LLMs, agent frameworks, and MAS architectures. The stochastic and failure-prone nature of runtime LLM decisions further complicates controlled experimentation and ablation. Existing benchmarks largely emphasize application-level outcomes (e.g., task success) and provide limited support for studying how these system knobs shape end-to-end behavior. We introduce MAESTRO, an open-source benchmark platform that packages 12 representative MASes and makes it easy to configure, run, and compare variants under controlled repeated executions. To enable diagnosis and the discovery of systematic findings, MAESTRO captures execution dynamics and key system signals, including call-graph similarity and resource consumption. Using MAESTRO, we study these 12 MASes spanning diverse domains, agent frameworks, architectures, and backend LLMs under controlled repeated runs. Our evaluation shows that even when high-level interaction structures remain stable, execution order can vary substantially, and tool usage exhibits strong architecture-dependent cost–accuracy trade-offs. Together, MAESTRO and these findings provide practical guidance for designing, optimizing, and deploying robust MASes.

ACM CAIS 2026 Sponsors