Accepted Demos
ACM CAIS 2026 — Demo Track
45 system demonstrations accepted to CAIS 2026. These are working implementations of AI systems and agent systems that will be presented live in San Jose, May 27–29, 2026. Demos are grouped by primary pillar. Click any demo for its abstract, author list, and video or artifact links.
Architectural Patterns & Composition 22 demos
Nexa: Automatically Surfacing Business Impacting Insights in E-commerce Applications
Smart Sun (Conviva), Sayan Sinha (Georgia Tech/Conviva), Haijie Wu (Conviva), Aditya Ganjam (Conviva), Qichu Gong (Conviva), Wei Wang (Conviva), Zhan Yang (Conviva), Bo Lin (Conviva), Vipul Harsh (Conviva), Ningning Hu (Conviva), B. Aditya Prakash (Georgia Tech), Vyas Sekar (CMU), Hui Zhang (Conviva)
An agentic framework that automatically discovers business-impacting behavioral patterns across billions of e-commerce user interactions using contrastive stateful trajectories.
EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools
Boer Zhang (Meta), Mingyan Wu (Northeastern University, China), Dongzhuoran Zhou (University of Oslo & Bosch Center of AI), Yuqicheng Zhu (University of Stuttgart & Bosch Center of AI), Wendong Fan (CAMEL-AI.org & Eigent.ai), Puzhen Zhang (CAMEL-AI.org & Eigent.ai), Zifeng Ding (University of Cambridge & Mina AI), Guohao Li (CAMEL-AI.org & Eigent.ai), Yuan He (Amazon)
A set of structured query and evidence processing tools that make web search more deliberate for deep research agents, improving accuracy across four benchmarks.
From Bug Report to Pull Request: An Autonomous Agent Pipeline for Production Issue Resolution
Roberto Milev (Navan), Uday Kanagala (Navan), Chris Cholette (Navan)
An end-to-end autonomous agent (Sherlock) that traces production errors from Jira through New Relic to GitHub and opens a fix PR, resolving 41% of tickets in ~9 minutes vs. a 4.2-hour manual baseline.
HearthNet: Edge Multi-Agent Orchestration for Smart Homes
Zhonghao Zhan (Imperial College London), Krinos Li (Imperial College London), Yefan Zhang (Independent Researcher), Hamed Haddadi (Imperial College London)
An edge multi-agent system that coordinates persistent, role-specialized LLM agents via MQTT and Git-backed state for natural-language smart home control on commodity hardware.
Parallel Environments for Agents
Shangyin Tan (University of California, Berkeley), Jialin Zhang (University of California, Berkeley), Matei Zaharia (UC Berkeley)
A framework that lets agents branch execution across parallel isolated environment instances, achieving 48% on SWE-bench Pro with a 15-point gain over single-path baselines.
TRACE: A Multi-Agent System for Natural Language-Driven Social Graph Investigation
Arunachaleshwar Ravichandran (Meta), Nicole Chen (Meta), Ankitesh Gupta (Meta), Antonios Broumas (Meta), Ioannis Konstantakopoulos (Meta), Seyoung Park (Meta)
A multi-agent system for social graph forensics that uses natural-language behavior detection and LLM-driven graph exploration, achieving 10x network expansion and 91.9% discovery of unknown suspicious entities.
Agent-Aided Design for Dynamic CAD Models
Mitch Adler (Unaffiliated), Matthew Russo (MIT), Michael Cafarella (MIT)
An agentic system for generating dynamic 3D CAD assemblies with moving parts, using external constraint solvers and visual feedback to overcome LLM spatial reasoning limitations.
Steering Agent Behavior via a Domain Expert-Driven Alignment-to-Optimization Bridge
Wesley Pasfield (University of San Diego and Databricks)
A system that makes agent behavior steerable by bridging domain expert trace labels to calibrated evaluation judges and optimized prompts, improving performance by 15.7%.
Scaling Expert Feedback with Reflective Edit Propagation in Compositional Knowledge Bases
Jiajing Guo (Bosch Research North America), Xueming Li (Robert Bosch GmbH), Jorge H Piazentin Ono (Bosch Research North America), Wenbin He (Bosch Research North America), Liu Ren (Bosch Research North America)
A reflective agent (RAID) that infers the semantic intent behind a single expert edit to a knowledge base and propagates corrections across the entire KB automatically.
L.A.K.E.: Logic Agent for Knowledge Extraction in Data Planning
Jean-Flavien Bussotti (Megagon Labs), Naoki Otani (Megagon Labs), Eser Kandogan (Megagon Labs)
An agentic data planning framework that maps natural language questions to executable workflows over diverse data lake sources, with interactive DAG-based provenance visualization.
cotomi Act: Learning to Automate Work by Watching You
Masafumi Oyamada (NEC Corporation), Kunihiro Takeoka (NEC Corporation), Kosuke Akimoto (NEC Corporation), Ryoma Obara (NEC Corporation), Masafumi Enomoto (NEC Corporation), Haochen Zhang (NEC Corporation), Daichi Haraguchi (NEC Corporation), Takuya Tamura (NEC Corporation)
A browser agent that learns organizational work patterns by passively observing user browsing, achieving 80.4% on WebArena while building a shared knowledge workspace.
Multi-Agent Position Classification with Tool Orchestration: Use Case System for Occupational Taxonomy Mapping
Vahid Farajijobehdar (Kariyer.net R&D Center), İlknur Köseoğlu Sarı (Kariyer.net R&D Center), Nazım Kemal Üre (Stanford University), Engin Zeydan (Centre Tecnològic de Telecomunicacions de Catalunya)
A confidence-gated multi-agent architecture using MCP that normalizes free-form job titles across five languages into occupational taxonomies, reducing classification time by 72.5%.
Agent 4: Teamwork and Collaboration for Vibe-Coding
Peter Zhong (Replit & Carnegie Mellon University), Jacky Zhao (Replit), Edouard Sioufi (Replit), James Austin (Replit), Bri Pool (Replit), Luis Héctor Chávez (Replit), Adi Dahiya (Replit), Will Ernst (Replit), Dawei Feng (Replit), Devin Halladay (Replit), Toby Ho (Replit), Zade Kaylani (Replit), Imen Kedir (Replit), Vaibhav Kumar (Replit), Zhen Li (Replit), Haya Ode (Replit), Nicholas Ondo (Replit), Darsh Patel (Replit), Alec Wang (Replit), Jordan Walke (Replit), Ibrahim Sheikh (Replit), Poorva Potnis (Replit), Michele Catasta (Replit)
A multi-agent coding architecture for vibe-coding that decomposes tasks into a DAG, executes them on isolated forked environments, and merges via incremental rebasing.
Genflow Ad Studio: A Compound AI Architecture for Brand-Aligned, Self-Correcting Video Generation
Debanshu Das (Google), Lavi Nigam (Google), Sunil Kumar Jang Bahadur (Google), Gopala Dhar (Google)
A compound AI system that enforces brand consistency in generative video production through retrieval-based brand DNA extraction and an adversarial multi-agent QC loop, improving brand compliance from 42% to 89%.
Demonstration of Pneuma-Seeker: Agentic System for Reifying and Fulfilling Information Needs on Tabular Data
Muhammad Imam Luthfi Balaka (The University of Chicago), Raul Castro Fernandez (The University of Chicago)
An agentic system that reifies vague user information needs as inspectable relational specifications for iterative tabular data discovery and provenance-aware execution.
GRAFT: gRPC-Routed Agent Framework for Tasking in Edge and Personal Devices
Chinmay Shringi (New York University), Alon Hillel-Tuch (New York University), Sariya Rizwan (Pace University)
A distributed edge orchestration system that routes structured tasks across heterogeneous personal devices using gRPC, enabling multi-device SLM workloads entirely off-grid.
SQLsaber: Agentic SQL Assistant for Efficient and High-Accuracy Natural Language Database Exploration
Sarthak Jariwala (Swift Solar Inc.)
An agentic SQL assistant equipped with four core tools that achieves 95.2% execution accuracy on BIRD with Claude Opus, a 47.8% improvement over prior best, in median 16 seconds.
DRCY: Agentic Hardware Design Reviews
Kyle Dumont (AllSpice Inc.), Nick Herbert (AllSpice Inc.), Hayder Tirmazi (AllSpice Inc.), Shrikanth Upadhayaya (AllSpice Inc.)
The first production-ready multi-agent LLM system for automated hardware schematic review, performing pin-by-pin analysis against datasheets and posting findings on design reviews.
SkyDiscover: A Flexible, Adaptive Framework for AI-Driven Scientific and Algorithmic Discovery
Shu Liu (UC Berkeley), Mert Cemri (UC Berkeley), Shubham Agarwal (UC Berkeley), Alexander Krentsel (UC Berkeley), Ashwin Naren (UC Berkeley), Qiuyang Mang (UC Berkeley), Zhifei Li (UC Berkeley), Akshat Gupta (UC Berkeley), Monishwaran Maheswaran (University of California, Berkeley), Audrey Cheng (UC Berkeley), Melissa Pan (UC Berkeley), Ethan Boneh (Stanford University), Kannan Ramchandran (University of California at Berkeley), Koushik Sen (UC Berkeley), Matei Zaharia (UC Berkeley), Alexandros G. Dimakis (UC Berkeley), Ion Stoica (UC Berkeley)
A modular framework for AI-driven algorithmic discovery via evolutionary search, achieving strongest open-source performance across 200+ optimization tasks and matching AlphaEvolve on many.
ClinicBot: A Guideline-Grounded Clinical Chatbot with Prioritized Evidence RAG and Verifiable Citations
Navapat Nananukul (University of Southern California), Mayank Kejriwal (University of Southern California)
A clinical chatbot that grounds answers in official medical guidelines using prioritized evidence retrieval and verifiable citations, addressing LLM hallucination risks in high-stakes diagnostic contexts.
Complex Knowledge Curation using Agentic Ontological Notebook Memory
Gully Burns (Unaffiliated), Paul Groth (University of Amsterdam)
A personal AI research assistant that uses explicit ontological commitments stored in a TypeDB knowledge graph as agent memory, demonstrated for job-hunting and disease-mechanism understanding.
A Compound AI Agent for Conversational Grant Discovery
Zhisheng Tang (University of Southern California), Mayank Kejriwal (University of Southern California)
A compound AI agent that unifies fragmented federal grant discovery across ~12,000 opportunities with conversational multi-turn search, used by 3,000+ researchers.
Evaluation & Benchmarking 6 demos
Peeking Under the Hood of Multi-Agent Systems
Tie Ma (Beihang University), Yixi Chen (KAUST), Vaastav Anand (MPI-SWS), Alessandro Cornacchia (KAUST), Amândio R. Faustino (KAUST), Guanheng Liu (Beihang University), Shan Zhang (Beihang University), Hongbin Luo (Beihang University), Suhaib A. Fahmy (KAUST), Zafar A. Qazi (LUMS and KAUST), Marco Canini (KAUST)
A practical toolkit for systematically comparing and tuning multi-agent system choices (backend LLMs, agent frameworks, and architectures) addressing the stochastic and failure-prone nature of real deployments.
Skilled AI Agents for Embedded and IoT Systems Development
Yiming Li (Duke University), Yuhan Cheng (Duke University), Mingchen Ma (Duke University), Yihang Zou (Duke University), Ningyuan Yang (Duke University), Wei Cheng (Duke University), Hai "Helen" Li (Duke University), Yiran Chen (Duke University), Tingjun Chen (Duke University)
A skills-based agentic framework for hardware-in-the-loop embedded/IoT development with a benchmark spanning 3 platforms, 23 peripherals, and 42 tasks validated on real hardware.
SREGym: A Live Training Ground for AI SRE Agents with High-Fidelity Failure Drills
Jackson Clark (University of Illinois Urbana-Champaign), Yiming Su (University of Illinois Urbana-Champaign), Saad Mohammad Rafid Pial (Bangladesh University of Engineering and Technology), Lily Gniedziejko (University of Illinois Urbana-Champaign), Tianyin Xu (University of Illinois Urbana-Champaign)
A live benchmark for AI SRE agents featuring high-fidelity failure drills with fault injection across OS kernels, hardware, and compound multi-event scenarios.
Introspectable, Updatable, and Uncertainty-aware Classification of Language Model Instruction-following
Allen Schmaltz (Reexpress AI, Inc.)
An open-source MCP server for uncertainty-aware binary classification of LLM instruction-following, using similarity-distance-magnitude estimators with interpretability-by-exemplar.
Valkyrie: A Microservice-Based Framework for Scalable Evaluation of AI Agents
Jarett Forzano (Vals AI), Omar Almatov (Vals AI), Langston Nashold (Vals AI), Nikil Ravi (Vals AI), Orestes Kassian (Vals AI)
A microservice-based benchmarking framework that decouples benchmark code, agent logic, and execution infrastructure for scalable, reproducible evaluation of AI agents.
Arena: Benchmarking AI Agent Frameworks Under Fixed-Model Conditions
Roberto Milev (Navan), Uday Kanagala (Navan)
An open-source benchmarking tool that evaluates agent frameworks under fixed-model conditions, finding that scenario-specific orchestration adds no measurable benefit over generic agentic loops.
Security & Privacy 3 demos
Governance by Construction for Generalist Agents
Segev Shlomov (IBM), Iftach Shoham (IBM), Alon Oved (IBM), Ido Levy (IBM), Sami Marreed (IBM), Harold Ship (IBM), Offer Akrabi (IBM), Sergey Zeltyn (IBM), Avi Yaeli (IBM), Nir Mashkif (IBM)
A modular policy-as-code governance layer that enforces compliance at five structural checkpoints across an LLM agent's execution pipeline without model fine-tuning.
Forge: Closing the Agentic Reliability Gap Between Self-Hosted and Frontier Language Models
Antoine Zambelli (Texas Instruments, Inc)
An open-source guardrail framework that enables an 8B self-hosted model to achieve 99% agentic workflow accuracy, matching frontier APIs.
Hedwig: Dynamic Autonomy for Coding Agents Under Local Oversight
Tanjal Shukla (University of Washington), Kevin Feng (University of Washington), Leijie Wang (University of Washington), Mohammad Rostami (Amazon GenAI Innovation Center), Amy Zhang (University of Washington)
A CLI coding agent that dynamically adapts its autonomy level based on developer-agent interaction history, tightening oversight in unfamiliar territory and loosening it where trust is earned.
System Optimization & Efficiency 5 demos
optimize_anything: A Universal API for Optimizing any Text Parameter
Lakshya A Agrawal (University of California, Berkeley), Donghyun Lee (University of California, Berkeley), Wenjie Ma (University of California, Berkeley), Karim Elmaaroufi (University of California, Berkeley), Shangyin Tan (University of California, Berkeley), Sanjit A. Seshia (University of California, Berkeley), Koushik Sen (University of California, Berkeley), Dan Klein (University of California, Berkeley), Ion Stoica (University of California, Berkeley), Joseph Gonzalez (University of California, Berkeley), Omar Khattab (Massachusetts Institute of Technology), Alexandros G. Dimakis (University of California, Berkeley), Matei Zaharia (University of California, Berkeley)
A declarative API that treats code, prompts, and agent architectures as optimizable text artifacts, achieving results like 47% faster Claude Code resolution and 89.5% ARC-AGI accuracy.
StigmergyRouter: A Fault-Aware Adaptive Routing Demo for Multi-Agent AI Systems
Jing Du (Northeastern University), Hang Zhao (Northeastern University), Kenneth Huang (University of Pennsylvania)
A pheromone-memory-based routing layer that adapts multi-agent routing from low-cost heartbeat feedback alone, improving failed-agent avoidance to 95.7% under specialist faults.
Automatically Learning Skills for Coding Agents
Shangyin Tan (University of California, Berkeley), Lakshya A Agrawal (UC Berkeley), Rohit Sandadi (University of California, Berkeley), Dan Klein (University of California, Berkeley), Koushik Sen (UC Berkeley), Alexandros G. Dimakis (UC Berkeley), Matei Zaharia (UC Berkeley)
A fully automated pipeline that learns repository-specific skills from synthetic tasks and evolutionary optimization, boosting coding agent performance without fine-tuning.
Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models
Jae-Won Chung (University of Michigan), Jeff J. Ma (University of Michigan), Jisang Ahn (University of Michigan), Yizhuo Liang (USC), Akshay Jajoo (Cisco Research), Myungjin Lee (Cisco Research), Mosharaf Chowdhury (University of Michigan)
A distributed serving system for any-to-any multimodal models that enables component disaggregation and independent scaling, delivering up to 3.81× higher throughput.
Orla: A Library for Serving LLM-Based Multi-Agent Systems
Rana Shahout (Harvard University), Hayder Tirmazi (Boston University), Minlan Yu (Harvard University), Michael Mitzenmacher (Harvard University)
A serving library for LLM multi-agent systems that separates request execution from workflow policy, with stage mapping, workflow orchestration, and cross-workflow KV cache management.
Engineering & Operations 9 demos
Behavioral Fingerprints for LLM Endpoint Stability and Identity
Jonah Leshin (VAIL), Manish Shah (VAIL), Ian Timmis (VAIL), Daniel Kang (University of Illinois at Urbana-Champaign)
A black-box monitoring system that detects behavioral changes in LLM endpoints caused by weight updates, quantization, or infrastructure changes via output distribution fingerprinting.
Sentinel: Autonomous Architectural Governance Through Commit Intelligence Across Multi-Repository Systems
Roberto Milev (Navan), Uday Kanagala (Navan)
A system of choreographed asynchronous agents that provides continuous architectural governance by processing commits and validating implementation against documented design decisions across 120+ microservices.
Pathfinder: Self-Improving Agent Trace Analysis via Adversarial Self-Play and Code Execution
Dhruv Atreja (Unaffiliated)
A schema-agnostic trace analysis system that uses adversarial self-play and executable code search to debug agent failures, outperforming RAG by 35 points.
Agent Lifecycle Toolkit (ALTK): Reusable Middleware Components for Robust AI Agents
Zidane Wright (IBM Research), Jason Tsay (IBM Research), Anupama Murthi (IBM Research), Osher Elhadad (IBM Research), Diego Del Rio (IBM), Saurabh Goyal (IBM Research), Kiran Kate (IBM Research), Jim A. Laredo (IBM Research), Koren Lazar (IBM Research), Vinod Muthusamy (IBM T.J. Watson Research Center), Yara Rizk (IBM Research)
A modular open-source middleware toolkit that systematically addresses agent failure modes across the full lifecycle with reusable components for validation, error recovery, and compliance.
Context Viewer: Turning LLM Contexts into Analyzable Artifacts
Srihari Sriraman (nilenso), Michael Isaac (Carnegie Mellon University), Atharva Raykar (nilenso), Heather Miller (Carnegie Mellon University)
A browser-based visual analytics system for exploratory analysis of LLM contexts, enabling users to inspect token usage, compare trajectories, and debug agent failures.
PAGER: Proactive Monitoring Agent for Enterprise AI Assistant
Junior Garcia (New York University), Sujan Dutta (Rochester Institute of Technology), Pranav Umakant Pujar (Adobe), Sai Sree Harsha (Adobe), Dan Luo (Adobe), Nikhil Vasudeva (Adobe), Bikas Saha (Adobe), Pritom Baruah (Adobe), Yunyao Li (Adobe)
A proactive monitoring agent that statistically models historical system errors to surface potential failures before they impact enterprise AI assistant users.
Wily: High-Performance Complexity Gated-Feedback for AI Coding Agents
Anthony Shaw (Macquarie University), Amin Beheshti (Macquarie University)
A high-performance code complexity analyzer integrated as gated feedback for coding agents, reducing complexity growth by 10-27% while maintaining comparable resolution rates.
Operama: Goal-Oriented Reliability and Self-Improvement for Multi-Agent Systems
Vishwanath Katharki (Operama), Sainyam Galhotra (Cornell University)
A runtime reliability framework for multi-agent systems that decomposes goals into verifiable sub-goals, monitors execution, and automatically proposes policy updates without retraining.
AgentClick: A Skill-Based Human-in-the-Loop Review Layer for Terminal AI Agents
Haomin Zhuang (University of Notre Dame), Hanwen Xing (University of Southern California), Xiangliang Zhang (University of Notre Dame)
A browser-based human-in-the-loop review layer for terminal AI agents that enables structured supervision of emails, plans, code, and agent trajectories via interactive web UI.