Registration has reached capacity. Join the waitlist

Peeking Under the Hood of Multi-Agent Systems

Tie Ma (Beihang University), Yixi Chen (KAUST), Vaastav Anand (MPI-SWS), Alessandro Cornacchia (KAUST), Amândio R. Faustino (KAUST), Guanheng Liu (Beihang University), Shan Zhang (Beihang University), Hongbin Luo (Beihang University), Suhaib A. Fahmy (KAUST), Zafar A. Qazi (LUMS and KAUST), Marco Canini (KAUST)

Evaluation & Benchmarking Architectural Patterns & Composition

A practical toolkit for systematically comparing and tuning multi-agent system choices (backend LLMs, agent frameworks, and architectures) addressing the stochastic and failure-prone nature of real deployments.

Presentation

Demo session

Thursday, May 28 · 4:30 PM – 6:00 PM

San Jose

View day schedule

Description

Multi-agent systems (MASes) powered by large language models (LLMs) are increasingly deployed in real applications, yet practitioners still lack a practical way to systematically compare and tune key system choices such as backend LLMs, agent frameworks, and MAS architectures. The stochastic and failure-prone nature of runtime LLM decisions further complicates controlled experimentation and ablation. Existing benchmarks largely emphasize application-level outcomes (e.g., task success) and provide limited support for studying how these system knobs shape end-to-end behavior. We introduce MAESTRO, an open-source benchmark platform that packages 12 representative MASes and makes it easy to configure, run, and compare variants under controlled repeated executions. To enable diagnosis and the discovery of systematic findings, MAESTRO captures execution dynamics and key system signals, including call-graph similarity and resource consumption. Using MAESTRO, we study these 12 MASes spanning diverse domains, agent frameworks, architectures, and backend LLMs under controlled repeated runs. Our evaluation shows that even when high-level interaction structures remain stable, execution order can vary substantially, and tool usage exhibits strong architecture-dependent cost–accuracy trade-offs. Together, MAESTRO and these findings provide practical guidance for designing, optimizing, and deploying robust MASes.

Artifacts & Links

Paper (ACM Digital Library)

                        Authors
                        Tie Ma
Beihang University
Yixi Chen
KAUST
Vaastav Anand
MPI-SWS
Alessandro Cornacchia
KAUST
Amândio R. Faustino
KAUST
Guanheng Liu
Beihang University
Shan Zhang
Beihang University
Hongbin Luo
Beihang University
Suhaib A. Fahmy
KAUST
Zafar A. Qazi
LUMS and KAUST
Marco Canini
KAUST