Registration has reached capacity. Join the waitlist

Accepted Demos

ACM CAIS 2026 — Demo Track

46 system demonstrations accepted to CAIS 2026. These are working implementations of AI systems and agent systems that will be presented live in San Jose, May 27–29, 2026. Demos are grouped by primary pillar. Click any demo for its abstract, author list, and video or artifact links.

Architectural Patterns & Composition 22 demos

Nexa: Automatically Surfacing Business Impacting Insights in E-commerce Applications

Smart Sun (Conviva), Sayan Sinha (Georgia Tech/Conviva), Haijie Wu (Conviva), Aditya Ganjam (Conviva), Qichu Gong (Conviva), Wei Wang (Conviva), Zhan Yang (Conviva), Bo Lin (Conviva), Vipul Harsh (Conviva), Ningning Hu (Conviva), B. Aditya Prakash (Georgia Tech), Vyas Sekar (CMU), Hui Zhang (Conviva)

An agentic framework that automatically discovers business-impacting behavioral patterns across billions of e-commerce user interactions using contrastive stateful trajectories.

Architectural Patterns & Composition

EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools

Boer Zhang (Meta), Mingyan Wu (Northeastern University, China), Dongzhuoran Zhou (University of Oslo & Bosch Center of AI), Yuqicheng Zhu (University of Stuttgart & Bosch Center of AI), Wendong Fan (CAMEL-AI.org & Eigent.ai), Puzhen Zhang (CAMEL-AI.org & Eigent.ai), Zifeng Ding (University of Cambridge & Mina AI), Guohao Li (CAMEL-AI.org & Eigent.ai), Yuan He (Amazon)

A set of structured query and evidence processing tools that make web search more deliberate for deep research agents, improving accuracy across four benchmarks.

Architectural Patterns & Composition

From Bug Report to Pull Request: An Autonomous Agent Pipeline for Production Issue Resolution

Roberto Milev (Navan), Uday Kanagala (Navan), Chris Cholette (Navan)

An end-to-end autonomous agent (Sherlock) that traces production errors from Jira through New Relic to GitHub and opens a fix PR, resolving 41% of tickets in ~9 minutes vs. a 4.2-hour manual baseline.

Architectural Patterns & Composition Engineering & Operations

HearthNet: Edge Multi-Agent Orchestration for Smart Homes

Zhonghao Zhan (Imperial College London), Krinos Li (Imperial College London), Yefan Zhang (Independent Researcher), Hamed Haddadi (Imperial College London)

An edge multi-agent system that coordinates persistent, role-specialized LLM agents via MQTT and Git-backed state for natural-language smart home control on commodity hardware.

Architectural Patterns & Composition

Parallel Environments for Agents

Shangyin Tan (University of California, Berkeley), Jialin Zhang (University of California, Berkeley), Matei Zaharia (UC Berkeley)

A framework that lets agents branch execution across parallel isolated environment instances, achieving 48% on SWE-bench Pro with a 15-point gain over single-path baselines.

Architectural Patterns & Composition System Optimization & Efficiency

TRACE: A Multi-Agent System for Natural Language-Driven Social Graph Investigation

Arunachaleshwar Ravichandran (Meta), Nicole Chen (Meta), Ankitesh Gupta (Meta), Antonios Broumas (Meta), Ioannis Konstantakopoulos (Meta), Seyoung Park (Meta)

A multi-agent system for social graph forensics that uses natural-language behavior detection and LLM-driven graph exploration, achieving 10x network expansion and 91.9% discovery of unknown suspicious entities.

Architectural Patterns & Composition

Agent-Aided Design for Dynamic CAD Models

Mitch Adler (Unaffiliated), Matthew Russo (MIT), Michael Cafarella (MIT)

An agentic system for generating dynamic 3D CAD assemblies with moving parts, using external constraint solvers and visual feedback to overcome LLM spatial reasoning limitations.

Architectural Patterns & Composition

Steering Agent Behavior via a Domain Expert-Driven Alignment-to-Optimization Bridge

Wesley Pasfield (University of San Diego and Databricks)

A system that makes agent behavior steerable by bridging domain expert trace labels to calibrated evaluation judges and optimized prompts, improving performance by 15.7%.

Architectural Patterns & Composition Evaluation & Benchmarking

Scaling Expert Feedback with Reflective Edit Propagation in Compositional Knowledge Bases

Jiajing Guo (Bosch Research North America), Xueming Li (Robert Bosch GmbH), Jorge H Piazentin Ono (Bosch Research North America), Wenbin He (Bosch Research North America), Liu Ren (Bosch Research North America)

A reflective agent (RAID) that infers the semantic intent behind a single expert edit to a knowledge base and propagates corrections across the entire KB automatically.

Architectural Patterns & Composition

L.A.K.E.: Logic Agent for Knowledge Extraction in Data Planning

Jean-Flavien Bussotti (Megagon Labs), Naoki Otani (Megagon Labs), Eser Kandogan (Megagon Labs)

An agentic data planning framework that maps natural language questions to executable workflows over diverse data lake sources, with interactive DAG-based provenance visualization.

Architectural Patterns & Composition Engineering & Operations

cotomi Act: Learning to Automate Work by Watching You

Masafumi Oyamada (NEC Corporation), Kunihiro Takeoka (NEC Corporation), Kosuke Akimoto (NEC Corporation), Ryoma Obara (NEC Corporation), Masafumi Enomoto (NEC Corporation), Haochen Zhang (NEC Corporation), Daichi Haraguchi (NEC Corporation), Takuya Tamura (NEC Corporation)

A browser agent that learns organizational work patterns by passively observing user browsing, achieving 80.4% on WebArena while building a shared knowledge workspace.

Architectural Patterns & Composition

Multi-Agent Position Classification with Tool Orchestration: Use Case System for Occupational Taxonomy Mapping

Vahid Farajijobehdar (Kariyer.net R&D Center), İlknur Köseoğlu Sarı (Kariyer.net R&D Center), Nazım Kemal Üre (Stanford University), Engin Zeydan (Centre Tecnològic de Telecomunicacions de Catalunya)

A confidence-gated multi-agent architecture using MCP that normalizes free-form job titles across five languages into occupational taxonomies, reducing classification time by 72.5%.

Architectural Patterns & Composition

Agent 4: Teamwork and Collaboration for Vibe-Coding

Peter Zhong (Replit & Carnegie Mellon University), Jacky Zhao (Replit), Edouard Sioufi (Replit), James Austin (Replit), Bri Pool (Replit), Luis Héctor Chávez (Replit), Adi Dahiya (Replit), Will Ernst (Replit), Dawei Feng (Replit), Devin Halladay (Replit), Toby Ho (Replit), Zade Kaylani (Replit), Imen Kedir (Replit), Vaibhav Kumar (Replit), Zhen Li (Replit), Haya Ode (Replit), Nicholas Ondo (Replit), Darsh Patel (Replit), Alec Wang (Replit), Jordan Walke (Replit), Ibrahim Sheikh (Replit), Poorva Potnis (Replit), Michele Catasta (Replit)

A multi-agent coding architecture for vibe-coding that decomposes tasks into a DAG, executes them on isolated forked environments, and merges via incremental rebasing.

Architectural Patterns & Composition Engineering & Operations

Genflow Ad Studio: A Compound AI Architecture for Brand-Aligned, Self-Correcting Video Generation

Debanshu Das (Google), Lavi Nigam (Google), Sunil Kumar Jang Bahadur (Google), Gopala Dhar (Google)

A compound AI system that enforces brand consistency in generative video production through retrieval-based brand DNA extraction and an adversarial multi-agent QC loop, improving brand compliance from 42% to 89%.

Architectural Patterns & Composition

Demonstration of Pneuma-Seeker: Agentic System for Reifying and Fulfilling Information Needs on Tabular Data

Muhammad Imam Luthfi Balaka (The University of Chicago), Raul Castro Fernandez (The University of Chicago)

An agentic system that reifies vague user information needs as inspectable relational specifications for iterative tabular data discovery and provenance-aware execution.

Architectural Patterns & Composition

GRAFT: gRPC-Routed Agent Framework for Tasking in Edge and Personal Devices

Chinmay Shringi (New York University), Alon Hillel-Tuch (New York University), Sariya Rizwan (Pace University)

A distributed edge orchestration system that routes structured tasks across heterogeneous personal devices using gRPC, enabling multi-device SLM workloads entirely off-grid.

Architectural Patterns & Composition System Optimization & Efficiency

SQLsaber: Agentic SQL Assistant for Efficient and High-Accuracy Natural Language Database Exploration

Sarthak Jariwala (Swift Solar Inc.)

An agentic SQL assistant equipped with four core tools that achieves 95.2% execution accuracy on BIRD with Claude Opus, a 47.8% improvement over prior best, in median 16 seconds.

Architectural Patterns & Composition Evaluation & Benchmarking Security & Privacy

DRCY: Agentic Hardware Design Reviews

Kyle Dumont (AllSpice Inc.), Nick Herbert (AllSpice Inc.), Hayder Tirmazi (AllSpice Inc.), Shrikanth Upadhayaya (AllSpice Inc.)

The first production-ready multi-agent LLM system for automated hardware schematic review, performing pin-by-pin analysis against datasheets and posting findings on design reviews.

Architectural Patterns & Composition Engineering & Operations

SkyDiscover: A Flexible, Adaptive Framework for AI-Driven Scientific and Algorithmic Discovery

Shu Liu (UC Berkeley), Mert Cemri (UC Berkeley), Shubham Agarwal (UC Berkeley), Alexander Krentsel (UC Berkeley), Ashwin Naren (UC Berkeley), Qiuyang Mang (UC Berkeley), Zhifei Li (UC Berkeley), Akshat Gupta (UC Berkeley), Monishwaran Maheswaran (University of California, Berkeley), Audrey Cheng (UC Berkeley), Melissa Pan (UC Berkeley), Ethan Boneh (Stanford University), Kannan Ramchandran (University of California at Berkeley), Koushik Sen (UC Berkeley), Matei Zaharia (UC Berkeley), Alexandros G. Dimakis (UC Berkeley), Ion Stoica (UC Berkeley)

A modular framework for AI-driven algorithmic discovery via evolutionary search, achieving strongest open-source performance across 200+ optimization tasks and matching AlphaEvolve on many.

Architectural Patterns & Composition

ClinicBot: A Guideline-Grounded Clinical Chatbot with Prioritized Evidence RAG and Verifiable Citations

Navapat Nananukul (University of Southern California), Mayank Kejriwal (University of Southern California)

A clinical chatbot that grounds answers in official medical guidelines using prioritized evidence retrieval and verifiable citations, addressing LLM hallucination risks in high-stakes diagnostic contexts.

Architectural Patterns & Composition

Complex Knowledge Curation using Agentic Ontological Notebook Memory

Gully Burns (Unaffiliated), Paul Groth (University of Amsterdam)

A personal AI research assistant that uses explicit ontological commitments stored in a TypeDB knowledge graph as agent memory, demonstrated for job-hunting and disease-mechanism understanding.

Architectural Patterns & Composition

A Compound AI Agent for Conversational Grant Discovery

Zhisheng Tang (University of Southern California), Mayank Kejriwal (University of Southern California)

A compound AI agent that unifies fragmented federal grant discovery across ~12,000 opportunities with conversational multi-turn search, used by 3,000+ researchers.

Architectural Patterns & Composition

Evaluation & Benchmarking 6 demos

Peeking Under the Hood of Multi-Agent Systems

Tie Ma (Beihang University), Yixi Chen (KAUST), Vaastav Anand (MPI-SWS), Alessandro Cornacchia (KAUST), Amândio R. Faustino (KAUST), Guanheng Liu (Beihang University), Shan Zhang (Beihang University), Hongbin Luo (Beihang University), Suhaib A. Fahmy (KAUST), Zafar A. Qazi (LUMS and KAUST), Marco Canini (KAUST)

A practical toolkit for systematically comparing and tuning multi-agent system choices (backend LLMs, agent frameworks, and architectures) addressing the stochastic and failure-prone nature of real deployments.

Evaluation & Benchmarking Architectural Patterns & Composition

Skilled AI Agents for Embedded and IoT Systems Development

Yiming Li (Duke University), Yuhan Cheng (Duke University), Mingchen Ma (Duke University), Yihang Zou (Duke University), Ningyuan Yang (Duke University), Wei Cheng (Duke University), Hai "Helen" Li (Duke University), Yiran Chen (Duke University), Tingjun Chen (Duke University)

A skills-based agentic framework for hardware-in-the-loop embedded/IoT development with a benchmark spanning 3 platforms, 23 peripherals, and 42 tasks validated on real hardware.

Evaluation & Benchmarking Architectural Patterns & Composition

SREGym: A Live Training Ground for AI SRE Agents with High-Fidelity Failure Drills

Jackson Clark (University of Illinois Urbana-Champaign), Yiming Su (University of Illinois Urbana-Champaign), Saad Mohammad Rafid Pial (Bangladesh University of Engineering and Technology), Lily Gniedziejko (University of Illinois Urbana-Champaign), Tianyin Xu (University of Illinois Urbana-Champaign)

A live benchmark for AI SRE agents featuring high-fidelity failure drills with fault injection across OS kernels, hardware, and compound multi-event scenarios.

Evaluation & Benchmarking

Introspectable, Updatable, and Uncertainty-aware Classification of Language Model Instruction-following

Allen Schmaltz (Reexpress AI, Inc.)

An open-source MCP server for uncertainty-aware binary classification of LLM instruction-following, using similarity-distance-magnitude estimators with interpretability-by-exemplar.

Evaluation & Benchmarking Security & Privacy

Valkyrie: A Microservice-Based Framework for Scalable Evaluation of AI Agents

Jarett Forzano (Vals AI), Omar Almatov (Vals AI), Langston Nashold (Vals AI), Nikil Ravi (Vals AI), Orestes Kassian (Vals AI)

A microservice-based benchmarking framework that decouples benchmark code, agent logic, and execution infrastructure for scalable, reproducible evaluation of AI agents.

Evaluation & Benchmarking Engineering & Operations

Arena: Benchmarking AI Agent Frameworks Under Fixed-Model Conditions

Roberto Milev (Navan), Uday Kanagala (Navan)

An open-source benchmarking tool that evaluates agent frameworks under fixed-model conditions, finding that scenario-specific orchestration adds no measurable benefit over generic agentic loops.

Evaluation & Benchmarking

Security & Privacy 3 demos

Governance by Construction for Generalist Agents

Segev Shlomov (IBM), Iftach Shoham (IBM), Alon Oved (IBM), Ido Levy (IBM), Sami Marreed (IBM), Harold Ship (IBM), Offer Akrabi (IBM), Sergey Zeltyn (IBM), Avi Yaeli (IBM), Nir Mashkif (IBM)

A modular policy-as-code governance layer that enforces compliance at five structural checkpoints across an LLM agent's execution pipeline without model fine-tuning.

Security & Privacy

Forge: Closing the Agentic Reliability Gap Between Self-Hosted and Frontier Language Models

Antoine Zambelli (Texas Instruments, Inc)

An open-source guardrail framework that enables an 8B self-hosted model to achieve 99% agentic workflow accuracy, matching frontier APIs.

Security & Privacy Evaluation & Benchmarking

Hedwig: Dynamic Autonomy for Coding Agents Under Local Oversight

Tanjal Shukla (University of Washington), Kevin Feng (University of Washington), Leijie Wang (University of Washington), Mohammad Rostami (Amazon GenAI Innovation Center), Amy Zhang (University of Washington)

A CLI coding agent that dynamically adapts its autonomy level based on developer-agent interaction history, tightening oversight in unfamiliar territory and loosening it where trust is earned.

Security & Privacy Architectural Patterns & Composition

System Optimization & Efficiency 5 demos

optimize_anything: A Universal API for Optimizing any Text Parameter

Lakshya A Agrawal (University of California, Berkeley), Donghyun Lee (University of California, Berkeley), Wenjie Ma (University of California, Berkeley), Karim Elmaaroufi (University of California, Berkeley), Shangyin Tan (University of California, Berkeley), Sanjit A. Seshia (University of California, Berkeley), Koushik Sen (University of California, Berkeley), Dan Klein (University of California, Berkeley), Ion Stoica (University of California, Berkeley), Joseph Gonzalez (University of California, Berkeley), Omar Khattab (Massachusetts Institute of Technology), Alexandros G. Dimakis (University of California, Berkeley), Matei Zaharia (University of California, Berkeley)

A declarative API that treats code, prompts, and agent architectures as optimizable text artifacts, achieving results like 47% faster Claude Code resolution and 89.5% ARC-AGI accuracy.

System Optimization & Efficiency Architectural Patterns & Composition

StigmergyRouter: A Fault-Aware Adaptive Routing Demo for Multi-Agent AI Systems

Jing Du (Northeastern University), Hang Zhao (Northeastern University), Kenneth Huang (University of Pennsylvania)

A pheromone-memory-based routing layer that adapts multi-agent routing from low-cost heartbeat feedback alone, improving failed-agent avoidance to 95.7% under specialist faults.

System Optimization & Efficiency Architectural Patterns & Composition

Automatically Learning Skills for Coding Agents

Shangyin Tan (University of California, Berkeley), Lakshya A Agrawal (UC Berkeley), Rohit Sandadi (University of California, Berkeley), Dan Klein (University of California, Berkeley), Koushik Sen (UC Berkeley), Alexandros G. Dimakis (UC Berkeley), Matei Zaharia (UC Berkeley)

A fully automated pipeline that learns repository-specific skills from synthetic tasks and evolutionary optimization, boosting coding agent performance without fine-tuning.

System Optimization & Efficiency Architectural Patterns & Composition

Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models

Jae-Won Chung (University of Michigan), Jeff J. Ma (University of Michigan), Jisang Ahn (University of Michigan), Yizhuo Liang (USC), Akshay Jajoo (Cisco Research), Myungjin Lee (Cisco Research), Mosharaf Chowdhury (University of Michigan)

A distributed serving system for any-to-any multimodal models that enables component disaggregation and independent scaling, delivering up to 3.81× higher throughput.

System Optimization & Efficiency Architectural Patterns & Composition

Orla: A Library for Serving LLM-Based Multi-Agent Systems

Rana Shahout (Harvard University), Hayder Tirmazi (Boston University), Minlan Yu (Harvard University), Michael Mitzenmacher (Harvard University)

A serving library for LLM multi-agent systems that separates request execution from workflow policy, with stage mapping, workflow orchestration, and cross-workflow KV cache management.

System Optimization & Efficiency Architectural Patterns & Composition

Engineering & Operations 10 demos

Behavioral Fingerprints for LLM Endpoint Stability and Identity

Jonah Leshin (VAIL), Manish Shah (VAIL), Ian Timmis (VAIL), Daniel Kang (University of Illinois at Urbana-Champaign)

A black-box monitoring system that detects behavioral changes in LLM endpoints caused by weight updates, quantization, or infrastructure changes via output distribution fingerprinting.

Engineering & Operations Evaluation & Benchmarking

Sentinel: Autonomous Architectural Governance Through Commit Intelligence Across Multi-Repository Systems

Roberto Milev (Navan), Uday Kanagala (Navan)

A system of choreographed asynchronous agents that provides continuous architectural governance by processing commits and validating implementation against documented design decisions across 120+ microservices.

Engineering & Operations Architectural Patterns & Composition

Pathfinder: Self-Improving Agent Trace Analysis via Adversarial Self-Play and Code Execution

Dhruv Atreja (Unaffiliated)

A schema-agnostic trace analysis system that uses adversarial self-play and executable code search to debug agent failures, outperforming RAG by 35 points.

Engineering & Operations Evaluation & Benchmarking

Agent Lifecycle Toolkit (ALTK): Reusable Middleware Components for Robust AI Agents

Zidane Wright (IBM Research), Jason Tsay (IBM Research), Anupama Murthi (IBM Research), Osher Elhadad (IBM Research), Diego Del Rio (IBM), Saurabh Goyal (IBM Research), Kiran Kate (IBM Research), Jim A. Laredo (IBM Research), Koren Lazar (IBM Research), Vinod Muthusamy (IBM T.J. Watson Research Center), Yara Rizk (IBM Research)

A modular open-source middleware toolkit that systematically addresses agent failure modes across the full lifecycle with reusable components for validation, error recovery, and compliance.

Engineering & Operations Security & Privacy

Context Viewer: Turning LLM Contexts into Analyzable Artifacts

Srihari Sriraman (nilenso), Michael Isaac (Carnegie Mellon University), Atharva Raykar (nilenso), Heather Miller (Carnegie Mellon University)

A browser-based visual analytics system for exploratory analysis of LLM contexts, enabling users to inspect token usage, compare trajectories, and debug agent failures.

Engineering & Operations

PAGER: Proactive Monitoring Agent for Enterprise AI Assistant

Junior Garcia (New York University), Sujan Dutta (Rochester Institute of Technology), Pranav Umakant Pujar (Adobe), Sai Sree Harsha (Adobe), Dan Luo (Adobe), Nikhil Vasudeva (Adobe), Bikas Saha (Adobe), Pritom Baruah (Adobe), Yunyao Li (Adobe)

A proactive monitoring agent that statistically models historical system errors to surface potential failures before they impact enterprise AI assistant users.

Engineering & Operations

Wily: High-Performance Complexity Gated-Feedback for AI Coding Agents

Anthony Shaw (Macquarie University), Amin Beheshti (Macquarie University)

A high-performance code complexity analyzer integrated as gated feedback for coding agents, reducing complexity growth by 10-27% while maintaining comparable resolution rates.

Engineering & Operations Evaluation & Benchmarking

Operama: Goal-Oriented Reliability and Self-Improvement for Multi-Agent Systems

Vishwanath Katharki (Operama), Sainyam Galhotra (Cornell University)

A runtime reliability framework for multi-agent systems that decomposes goals into verifiable sub-goals, monitors execution, and automatically proposes policy updates without retraining.

Engineering & Operations Evaluation & Benchmarking

AgentClick: A Skill-Based Human-in-the-Loop Review Layer for Terminal AI Agents

Haomin Zhuang (University of Notre Dame), Hanwen Xing (University of Southern California), Xiangliang Zhang (University of Notre Dame)

A browser-based human-in-the-loop review layer for terminal AI agents that enables structured supervision of emails, plans, code, and agent trajectories via interactive web UI.

Engineering & Operations Security & Privacy

A Closed-Loop Platform for Prompt-to-Production Development and Autonomous Self-Repair of Apache Flink Streaming Jobs

Purshotam Shah (Yahoo Inc.), Shubhankar Unhale (Yahoo Inc.), Isaiah Zwick-Schachter (Yahoo Inc.), Aaron Gresch (Yahoo Inc.), Chris Williamson (Yahoo Inc.)

An integrated AI platform that autonomously develops, deploys, monitors, and repairs Apache Flink streaming jobs using RAG-powered code generation and a real-time debugging sidecar on Kubernetes.

Engineering & Operations

What Each Pillar Covers

Architectural Patterns & Composition

How multiple models, tools, and retrievers are composed into coherent systems. Research in this space advances inference-time scaling, studies the generation/verification asymmetry behind verifier-based architectures, explores retrieval-augmented, multi-agent, and tool-augmented designs, and asks what principled modular composability looks like in practice, among other directions.

Evaluation & Benchmarking

Characterizing the behavior of compound systems in realistic conditions, including the failure modes that are hardest to detect. Topics include new benchmarks, end-to-end metrics, evaluation methodology that holds up as underlying models grow more capable, and more.

Security & Privacy

Understanding and mitigating threats when agents execute tools on real systems. Topics include threat models for prompt injection and tool misuse, defenses against adaptive attackers, alignment for AI systems with real-world consequences, and more.

System Optimization & Efficiency

Making compound systems faster and cheaper while preserving their capabilities. Topics include end-to-end optimization of non-differentiable pipelines, principled caching, routing, serving at agent scale, and more.

Engineering & Operations

Making compound AI systems reliable in production. Topics include observability for long agent traces, deployment pipelines for compound systems, developer tools the field still needs to invent, and more.