Registration is now open! Early-bird pricing available through May 5, 2026. Register now

Tressoir: Unifying Online, Offline, and HIL Design and Evolution of Multi-Agent Systems via Interpretable Blueprints

Amadou Ngom (Massachusetts Institute of Technology), Ziniu Wu (Massachusetts Institute of Technology), Jason Mohoney (Massachusetts Institute of Technology), James Moore (Massachusetts Institute of Technology), Alex Zhang (Massachusetts Institute of Technology), Samuel Madden (Massachusetts Institute of Technology), Tim Kraska (Massachusetts Institute of Technology)

Architectural Patterns & Composition

Abstract

We explore a principled approach that jointly designs and evolves the architectures, prompts, tools, and knowledge of multi-agent systems, whether online, offline, or with human guidance. We first propose \textit{Interpretable Blueprints (IBs)} that jointly encode online-interpretable Design Documents (describing multi-agent architecture patterns, tool use/creation guidelines, lessons learned, etc.) with offline-generated \textit{materialized} components and scaffolds proven to be high-quality or cost-effective. Second, we propose a \textit{supervising interpreter} that co-interprets the IB and the task to construct a specialized agentic system on the fly, without assuming any pre-existing implementation, thereby enabling maximal adaptation to the task. IBs are also the primary online communication mechanism between agents. Offline learning is a subset of this approach; \textit{learning IBs} encode learning strategies that let the interpreter orchestrate metrics collection and IB improvement. Human guidance is enabled at every layer, whether through co-editing IBs or by steering online or offline interpretation in ways that the system learns from over time. To instantiate this vision, we develop \textit{Tressoir}, an IB-centric framework that unifies online, offline, and human-guided evolution under a single mechanism. Tressoir is tailored for long-running, complex projects with tasks that build on each other and require continual learning during or in between executions. Its generality further allows it to bootstrap itself, where its own features are now self-generated with human guidance. We also evaluate Tressoir on shorter-term benchmarks. On SWE-Bench-Pro's Qutebrowser subset, Tressoir with Claude 4.6 Opus reaches 75.95\% vs.\ 56.96\% for SWE-Agent; on ScreenSpot-Pro, it lifts Gemini 3 Flash from a 69.1\% baseline to 83.05\%; and on Bird-Critic Flash, Tressoir with Gemini 3 Flash tools scores 56\%, exceeding SQL-ACT with Claude 4.6 Opus at 52\%.

                        Authors
                        Amadou Ngom
Massachusetts Institute of Technology
Ziniu Wu
Massachusetts Institute of Technology
Jason Mohoney
Massachusetts Institute of Technology
James Moore
Massachusetts Institute of Technology
Alex Zhang
Massachusetts Institute of Technology
Samuel Madden
Massachusetts Institute of Technology
Tim Kraska
Massachusetts Institute of Technology