Accepted Papers
ACM CAIS 2026 — Research Track
61 papers accepted to the CAIS 2026 research track, grouped by primary pillar.
Architectural Patterns & Composition 25 papers
optany: Unified Text Optimization can Outperform Specialized Systems
Lakshya A Agrawal (University of California, Berkeley), Donghyun Lee (University of California, Berkeley), Wenjie Ma (University of California, Berkeley), Karim Elmaaroufi (University of California, Berkeley), Rohit Sandadi (University of California, Berkeley), Shangyin Tan (University of California, Berkeley), Sanjit A. Seshia (University of California, Berkeley), Koushik Sen (University of California, Berkeley), Dan Klein (University of California, Berkeley), Ion Stoica (University of California, Berkeley), Joseph Gonzalez (University of California, Berkeley), Omar Khattab (Massachusetts Institute of Technology), Alexandros G. Dimakis (University of California, Berkeley), Matei Zaharia (University of California, Berkeley)
optany is a single LLM-based optimization system that achieves state-of-the-art results across six diverse tasks simultaneously—nearly tripling Gemini Flash's ARC-AGI accuracy, cutting cloud scheduling costs 40%, and matching AlphaEvolve on circle packing—by framing all problems as improving a text artifact evaluated by a scoring function. The results challenge the assumption that domain-specific optimization tools are necessary.
Context, Reasoning, and Hierarchy: A Cost–Performance Study of Compound LLM Agent Design in an Adversarial POMDP
Igor Bogdanov (Carleton University), Chung-Horng Lung (Carleton University), Thomas Kunz (Carleton University), Jie Gao (Carleton University), Adrian Taylor (Defence R&D Canada), Marzia Zaman (Cistel Technology)
A controlled study of compound LLM agent design in CybORG, an adversarial cyber defense environment, that separates the effects of context design, reasoning depth, and task decomposition on agent performance. It gives practitioners empirical guidance on which design choices genuinely improve outcomes versus which merely increase inference cost via token consumption.
Tressoir: Unifying Online, Offline, and HIL Design and Evolution of Multi-Agent Systems via Interpretable Blueprints
Amadou Ngom (Massachusetts Institute of Technology), Ziniu Wu (Massachusetts Institute of Technology), Jason Mohoney (Massachusetts Institute of Technology), James Moore (Massachusetts Institute of Technology), Alex Zhang (Massachusetts Institute of Technology), Samuel Madden (Massachusetts Institute of Technology), Tim Kraska (Massachusetts Institute of Technology)
Tressoir jointly designs and evolves multi-agent architectures, prompts, tools, and knowledge through human-readable Interpretable Blueprints that encode both online design intent and offline-generated high-quality components. It supports automated, human-guided, and hybrid optimization modes, making multi-agent system development more systematic and reproducible.
Glia: A Human-Inspired AI for Automated Systems Design and Optimization
Pouya Hamadanian (MIT), Pantea Karimi (MIT), Arash Nasr-Esfahany (MIT), Kimia Noorbakhsh (MIT), Joseph Chandler (MIT), Ali Parandeh (Independent), Mohammad Alizadeh (MIT), Hari Balakrishnan (MIT)
Glia is an AI system that autonomously designs and optimizes computer network mechanisms using a human-inspired multi-agent workflow in which specialized agents reason, experiment, and analyze collaboratively. Applied to distributed systems challenges, it produces interpretable designs that rival human expert solutions.
A Language for Describing Agentic LLM Contexts
Noga Peleg Pelc (Bar Ilan University, Israel), Gal A. Kaminka (Bar Ilan University, Israel), Yoav Goldberg (Bar Ilan University, Israel)
A formal language for precisely describing how LLM context is composed and evolves across agent interaction steps, replacing the informal prose and ad-hoc diagrams currently used in context engineering. It enables teams and researchers to communicate context structure unambiguously—across prompt templates, multi-turn history, and system instructions.
Retrieval-Augmented LLMs for Security Incident Analysis
Xavier Cadet (Dartmouth College), Aditya Vikram Singh (Northeastern University), Harsh Mamania (Northeastern University), Edward Koh (Dartmouth College), Alex Fitts (Punch Cyber), Dirk Van Bruggen (Punch Cyber), Simona Boboila (Northeastern University), Peter Chin (Dartmouth College), Alina Oprea (Northeastern University)
A RAG-based system that automates cybersecurity incident analysis by mapping evidence from heterogeneous logs to MITRE ATT&CK techniques and generating structured incident reports. It substantially reduces the manual effort of correlating intrusion alerts, network records, and authentication events into a coherent attack narrative.
Improving Coherence and Persistence in Agentic AI for System Optimization
Pantea Karimi (MIT), Kimia Noorbakhsh (MIT), Mohammad Alizadeh (MIT), Hari Balakrishnan (MIT)
A framework for automated system heuristic design that addresses two fundamental LLM agent failure modes: evolutionary neighborhood bias (getting stuck in local optima) and the coherence ceiling (context degradation over long agent runs). Together the fixes enable more reliable, multi-step discovery of high-performance system configurations.
fastWorkflow: Closing the Agentic Performance Gap Between Small and Frontier Language Models
Sanchit Satija (Radiant Logic), Aditya Bhatt (Radiant Logic), Priyanshu Jani (Radiant Logic), Dhar Rawal (Radiant Logic)
fastWorkflow is a dual-mode agentic framework that closes the performance gap between small and frontier language models by addressing a five-dimensional taxonomy of SLM failure modes: NLU, tool management, planning, agentic reasoning, and context management. It enables smaller, lower-cost, privacy-preserving models to reach near-frontier task success rates on agentic benchmarks.
TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples
Shuren Xia (Rutgers University), Qiwei Li (Rutgers University), Taqiya Ehsan (Rutgers University), Jorge Ortiz (Rutgers University)
TraceFix is a verification-first pipeline that synthesizes TLA+-verified coordination protocols for multi-agent systems, compiles them into per-agent system prompts, and enforces them at runtime with a monitor that rejects out-of-topology operations. It provides formal correctness guarantees for multi-agent coordination from nothing more than a natural-language task description.
Composing Policy Gradients and Prompt Optimization for Language Model Programs
Noah Ziems (University of Notre Dame), Dilara Soylu (Stanford University), Lakshya A Agrawal (UC Berkeley), Isaac Miller (Normal Computing), Liheng Lai (UC Berkeley), Chen Qian (Databricks), Kaiqiang Song (Zoom, Inc.), Meng Jiang (University of Notre Dame), Dan Klein (UC Berkeley), Matei Zaharia (UC Berkeley, Databricks), Karel D’Oosterlinck (Contextual AI), Christopher Potts (Stanford University), Omar Khattab (MIT)
A generalization of GRPO to modular multi-prompt LLM programs that enables RL post-training across agent systems with multiple LM calls, variable-length trajectories, and interrupted rollouts. The paper shows for the first time that RL training and automatic prompt optimization compose well together, jointly improving accuracy by 11% on average.
LiveGraph: A Compound AI System for Evolving Knowledge Graph Construction from Streaming Data
Rakshit Agrawal (Microsoft), Pritesh Kanani (Microsoft), Madhu Sudan (Microsoft), Ashish Gujarathi (Microsoft), Dhruv Srivastava (Microsoft), Mikita Reut (Microsoft), Naveen Shrivastava (Microsoft)
LiveGraph is a compound AI system for continuously updating knowledge graphs from streaming data using a formal operator algebra—five atomic operators with inverses—that enables full rollback, operator logging, and component-level error attribution. It makes incremental knowledge graph construction from heterogeneous live data sources reliable and debuggable.
Expansion-Contraction: A Multi-Agent Graph Traversal Pattern for Compound AI Systems
Aiham Taleb (AWS), Zainab Afolabi (AWS), Joao Sousa (AWS), Mathias Seidel (Continental Tires)
Expansion-Contraction is a domain-agnostic multi-agent coordination pattern in which an expansion phase dynamically spawns specialist agents mapped to nodes of a domain graph, and a contraction phase aggregates their findings inward toward a verdict. Agent topology emerges from data structure rather than hand-design, and each agent's context stays small regardless of graph size.
Vista: Verifier-in-the-Loop Agentic RL for Semantic Program Synthesis in Quantum Computing
Cong Yu (Aalto University), Tuo Shi (Aalto University), Valter Uotila (Aalto University), Shilong Deng (University of Liverpool), Lei You (Technical University of Denmark), Bo Zhao (Aalto University)
Vista is a verifier-in-the-loop RL system for quantum program synthesis that efficiently schedules staged verification calls—compilers, simulators, optimizers—across training, training agents on correctness signals rather than text plausibility. It demonstrates that formally verified program synthesis is tractable as a learning problem when verification cost is treated as a first-class system constraint.
Robust Agent Compensation (RAC): Teaching AI Agents to Compensate
Srinath Perera (WSO2, Santa Clara, CA, USA), Kaviru Hapuarachchi (WSO2, Santa Clara, CA, USA), Frank Leymann (University of Stuttgart, Stuttgart, Germany), Rania Khalaf (WSO2, Santa Clara, CA, USA)
Robust Agent Compensation (RAC) is an architectural extension that adds automatic rollback and recovery to existing agent frameworks, providing a log-based safety net for unintended side effects without requiring agents to be rewritten. It's compatible with most major frameworks via existing extension points and validated on τ²-bench and REALM-Bench.
Dossier: Deep Research via Ledger-Driven Branching Search and Query Encoding Learning
Om Chabra (MIT), Noah Ziems (University of Notre Dame), Meng Jiang (University of Notre Dame), Omar Khattab (MIT), Hari Balakrishnan (MIT)
Dossier is a deep research agent that replaces linear search trajectories with parallel branching search, tracking claims, contradictions, and information gaps in a persistent Research Ledger. It consistently outperforms ReAct-style agents on multi-hop research tasks by preventing early retrieval failures from compounding through the reasoning chain.
Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation
Jackson Hassell (Megagon Labs), Dan Zhang (Megagon Labs), Hannah Kim (Megagon Labs), Tom Mitchell (Megagon Labs), Estevam Hruschka (Megagon Labs)
A memory-augmented agent framework that enables LLM agents to learn new classification functions from labeled examples at inference time, without any parameter updates. It uses LLM-generated episodic critiques of specific past mistakes and distills them into reusable semantic task-level guidance, outperforming few-shot prompting and matching fine-tuned baselines on diverse tasks.
FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
Igor Bogdanov (Carleton University), Chung-Horng Lung (Carleton University), Thomas Kunz (Carleton University), Jie Gao (Carleton University), Adrian Taylor (Defence R&D Canada), Marzia Zaman (Cistel Technology)
FORGE is a population-based protocol in which LLM agents self-generate and evolve natural-language memory—textual heuristics and few-shot demonstrations—through reflection and competitive selection across episodes. It improves agent decision-making over time with no gradient updates, using only the same base LLM that the agent runs on.
Decomposing Sycophancy, Fragility, Consensus Collapse and Cost in Homogeneous Multi-Agent LLM Debate
Blaž Bertalanič (Jožef Stefan Institute), Carolina Fortuna (Jozef Stefan Institute, Jamova 39, Ljubljana, Slovenia)
A controlled empirical study showing that homogeneous multi-agent debate amplifies rather than corrects LLM errors through sycophancy and consensus collapse, often performing worse than a single isolated model on hard benchmarks. The results undercut a widely-held assumption that peer review among agents filters hallucinations.
How To Steer Your Multi-Agent System: Human-LLM Collaborative Planning
Zeyu He (Penn State University), Hannah Kim (Megagon Labs), Dan Zhang (Megagon Labs), Estevam Hruschka (Megagon Labs)
A user study and prototype showing that humans can effectively supervise multi-agent plans at the process level—inspecting, steering, and refining intermediate reasoning—rather than only verifying final outputs. The work characterizes hybrid human-AI planning patterns and identifies the effort-control-risk trade-offs that determine when process-level supervision is worth the cost.
Scideator: Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation
Marissa Radensky (University of Washington), Simra Shahid (Microsoft), Raymond Fok (University of Washington), Pao Siangliulue (Allen Institute for AI), Tom Hope (Allen Institute for AI), Daniel S. Weld (Allen Institute for AI)
Scideator is a human-LLM compound system for scientific ideation that extracts purposes, mechanisms, and evaluations from researcher-supplied papers, then lets users interactively recombine these facets to generate and evaluate novel research ideas. It operationalizes facet-based ideation with both an LLM-driven novelty evaluator and a user study confirming idea quality.
Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling
Naoki Otani (Megagon Labs), Nikita Bhutani (Megagon Labs), Hannah Kim (Megagon Labs), Dan Zhang (Megagon Labs), Estevam Hruschka (Megagon Labs)
An empirical study overturning the assumption that single-step 'think-then-act' planning is the right default for agentic AI: for data-centric tool-calling tasks, full-horizon planning—generating a complete plan before any execution—consistently yields higher accuracy. The finding challenges a foundational design choice in most current agent frameworks.
Open Agent Specification: A Unified Representation for AI Agents
Soufiane Amini (Oracle), Yassine Benajiba (Oracle), Cesare Bernardis (Oracle), Paul Cayet (Oracle), Hassan Chafi (Oracle), Abderrahim Fathan (Oracle), Louis Faucon (Oracle), Damien Hilloulin (Oracle, Zurich, Switzerland), Sungpack Hong (Oracle), Ingo Kossyk (Oracle), Tirthankar Lahiri (Oracle), Tran Minh Son Le (Oracle), Rhicheek Patra (Oracle), Sujith Ravi (Oracle), Jonas Schweizer (Oracle), Jyotika Singh (Oracle), Shailender Singh (Oracle), Weiyi Sun (Oracle), Kartik Talamadupula (Oracle), Jerry Xu (Oracle)
Open Agent Specification is a framework-agnostic declarative language for defining AI agents and multi-agent workflows, enabling portability and interoperability across agent frameworks. It provides common abstractions for control flow, data semantics, and tool integration so that workflows built in one framework can run in another.
AI Realtor: Towards Grounded Persuasive Language Generation for Automated Copywriting
Jibang Wu (The University of Chicago), Chenghao Yang (The University of Chicago), Yi Wu (The University of Chicago), Simon Mahns (The University of Chicago), Chaoqi Wang (The University of Chicago), Hao Zhu (Stanford University), Fei Fang (Carnegie Mellon University), Haifeng Xu (The University of Chicago)
AI Realtor is an agentic copywriting framework for real estate marketing that grounds persuasive language generation in factual property attributes, predicted marketable features, and user preferences. It demonstrates that LLM-based content generation can be simultaneously persuasive, personalized, and factually anchored—a combination prior automated copywriting systems have not achieved.
MARVIS: Modality Adaptive Reasoning over VISualizations
Benjamin Feuer (Stanford University), Lennart Purucker (Prior Labs), Oussama Elachqar (Oumi), Chinmay Hegde (New York University)
MARVIS converts latent embeddings from small specialized ML models into visual representations, then uses a VLM's spatial reasoning to make predictions on non-traditional modalities and long-tail domains. It achieves competitive accuracy without requiring raw data exposure or retraining the underlying specialized models.
Equitable Ranking in Heterogeneous Marketplace Ecosystems: A Foundation Model Framework for Quality-Aware Fairness
Saurabh Krishna Kansara (None)
FairRank-LLM is a production-deployed marketplace ranking framework that uses foundation model semantic representations to simultaneously address fairness, cold-start disadvantage, and quality—eliminating the rich-get-richer dynamics that hurt new or under-resourced providers in interaction-history-dependent ranking systems. It has been deployed in a live marketplace ecosystem.
Evaluation & Benchmarking 12 papers
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
Anna Mazhar (Cornell University), Huzaifa Suri (UIUC), Sainyam Galhotra (Cornell University)
A study showing how uncertainty in heterogeneous input artifacts—PDFs, spreadsheets, slide decks—propagates and amplifies through multi-agent workflows, producing qualitatively different execution trajectories under controlled perturbations. The results show that outcome-only evaluation of agentic systems systematically misses contamination-induced failures that are only visible at the trace level.
ViBench: A Benchmark on Vibe Coding
Peter Zhong (Replit & Carnegie Mellon University), Pashootan Vaezipoor (Georgian AI Lab), Fuyang Cui (Georgian AI Lab), Vaibhav Kumar (Replit), James Austin (Replit), Azin Asgarian (Georgian AI Lab), Toby Ho (Replit), Paul Inder (Georgian AI Lab), Imen Kedir (Replit), Zhen Li (Replit), Nicholas Ondo (Replit), Asna Shafiq (Georgian AI Lab), Ibrahim Sheikh (Replit), Edouard Sioufi (Replit), Setareh Soltanieh (Georgian AI Lab), Ben Wilde (Georgian AI Lab), Jacky Zhao (Replit), Ryan Carelli (Replit), Heather Miller (Carnegie Mellon University), Michele Catasta (Replit)
ViBench is the first open-source benchmark for evaluating AI agents on realistic end-to-end vibe coding—natural-language-to-working-web-application creation—derived from production traces across 15 applications. It measures the complete development process rather than isolated code completion, revealing where the current generation of agents breaks down in practice.
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
Hung Tran (Vals AI), Langston Nashold (Vals AI), Rayan Krishnan (Vals AI), Antoine Bigeard (Vals AI), Alex Gu (MIT)
Vibe Code Bench is a benchmark of 100 web application specifications with over 10,000 browser-evaluated substeps showing that even the best frontier models complete only 58% of realistic end-to-end application development tasks. The benchmark uses an autonomous browser agent to verify deployed applications against behavioral specifications, measuring what users actually care about rather than code syntax.
Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook
Ming Li (University of Maryland), Xirui Li (University of Maryland), Tianyi Zhou (Mohamed bin Zayed University of Artificial Intelligence)
An empirical study of Moltbook, an AI agent social network, showing that artificial agent societies exhibit human-like socialization dynamics—semantic stabilization, influence persistence, and collective consensus formation—at scale. The findings raise fundamental questions about how social norms and coordination emerge in AI-populated environments.
Reasoning-Intensive Regression
Diane Tchuindjo (Massachusetts Institute of Technology), Omar Khattab (Massachusetts Institute of Technology)
This paper establishes a new problem class—reasoning-intensive regression—where LLMs deduce subtle numerical scores from text, covering rubric grading, dense reward modeling, and domain-specific retrieval scoring. LLMs prove surprisingly effective at this with limited task-specific training data, opening a practical path to automated evaluation in settings where labeled data is scarce.
Willful Disobedience: Automatically Detecting Failures in Agentic Traces
Reshabh K Sharma (University of Washington), Shraddha Barke (Microsoft Research), Benjamin Zorn (Microsoft Research)
AgentPex automatically detects procedural failures in agentic execution traces—wrong workflow routing, unsafe tool use, violations of prompt-specified rules—by extracting behavioral rules from agent prompts and checking entire execution histories against them. It catches critical failures that outcome-only benchmarks miss, making it practical to validate agent behavior at the workflow level.
Why Johnny Can’t Use Agents: Industry Aspirations vs. User Realities with AI Agents
Pradyumna Shome (Carnegie Mellon University), Sashreek Krishnan (Carnegie Mellon University), Sauvik Das (Carnegie Mellon University)
A mixed-methods study of 102 commercial AI agents and 31 end-user participants that quantifies the gap between what the industry markets AI agents as being able to do and what real users can actually accomplish. The findings surface recurring usability failure patterns and call into question whether current agent UX is ready for the mainstream use cases being advertised.
OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction
Skyler Hallinan (University of Southern California), Thejas Venkatesh (Samaya AI), Xiang Ren (University of Southern California), Sai Praneeth Karimireddy (University of Southern California), Ashwin Paranjape (Samaya AI), Yuhao Zhang (Samaya AI), Jack Hessel (Samaya AI)
OpaqueToolsBench evaluates whether LLM agents can learn to use poorly-documented tools through interaction and self-generated documentation improvement, across three environments: general function calling, chess, and long-horizon agentic tasks. Most current agents show limited ability to adapt to opaque tools, exposing a practical gap between benchmark performance and real-world tool-use reliability.
Generating Expressive and Customizable Evals for Timeseries Data Analysis Agents with AgentFuel
Aadyaa Maddi (Carnegie Mellon University), Prakhar Naval (Rockfish), Deepti Mande (Rockfish), Muckai Girish (Rockfish), Shane Duan (Rockfish), Vyas Sekar (CMU/Rockfish)
AgentFuel is a framework for generating expressive, domain-specific evaluations for conversational data analysis agents, exposing systematic gaps in how popular open-source and proprietary agents handle domain-relevant timeseries queries. It gives practitioners a principled way to evaluate and compare agents on their own data and query patterns rather than generic benchmarks.
DraftNEPABench: A Benchmark for Drafting NEPA Document Sections with Coding Agents
Anurag Acharya (Pacific Northwest National Laboratory), Bishal Lakha (Pacific Northwest National Laboratory), Rounak Meyur (Pacific Northwest National Laboratory), Rohan Nuttall (OpenAI), Sarthak Chaturvedi (Pacific Northwest National Laboratory), Anika Halappanavar (Pacific Northwest National Laboratory), Leah Hare (Pacific Northwest National Laboratory), Lin Zeng (Pacific Northwest National Laboratory), Mike Parker (Pacific Northwest National Laboratory), Sai Munikoti (Pacific Northwest National Laboratory), Sameera Horawalavithana (Pacific Northwest National Laboratory)
DraftNEPABench is a benchmark that challenges AI coding agents to draft sections of Environmental Impact Statements, extending the agentic evaluation frontier beyond software engineering into high-stakes structured document creation. Results reveal where frontier agents still fall short when domain knowledge, regulatory structure, and long-form coherence are all required.
Benchmarking Agents in Insurance Underwriting Environments
Amanda Dsouza (Snorkel AI), Ramya Ramakrishnan (Snorkel AI), Charles Dickens (Snorkel AI), Bhavishya Pohani (Snorkel AI), Christopher M Glaze (Snorkel AI)
UNDERWRITE is an expert-first benchmark for evaluating AI agents in insurance underwriting, built in close collaboration with domain practitioners to capture enterprise-realistic complexity: proprietary business knowledge, noisy tool interfaces, and imperfect data. It fills the gap left by open-domain benchmarks that overemphasize code and narrow accuracy metrics.
Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models
Nimet Beyza Bozdag (University of Illinois Urbana-Champaign), Shuhaib Mehri (University of Illinois Urbana-Champaign), Gokhan Tur (University of Illinois Urbana-Champaign), Dilek Hakkani-Tur (University of Illinois Urbana-Champaign)
Persuade Me if You Can is a framework for systematically evaluating both LLM persuasion effectiveness and susceptibility to persuasion in multi-agent interactions. It shows that susceptibility to persuasion is a distinct and critical alignment property—not captured by existing safety evaluations—and varies substantially across models and persuasion strategies.
Security & Privacy 11 papers
Does Safety Molt? Evaluating LLM Safety in Multi-Agent Social Environments
Aman Priyanshu (Foundation-AI), Supriti Vijay (Foundation-AI), Esha Pahwa (Corvic AI)
A study showing that LLM safety degrades substantially in persistent multi-agent social environments compared to single-turn evaluation: privacy violations nearly double when shifting from isolated to social multi-turn settings, and leakage is socially contagious—spreading across agent communities through interaction. The findings reveal a fundamental gap in how current safety evaluations assess deployed agents.
A HIPAA-Compliant Architecture for Agentic Clinical AI Systems
Himanshu Tripathi (The University of Alabama), Subash Neupane (Meharry Medical College), Sudip Mittal (The University of Alabama), Shahram Rahimi (University of Alabama), Vibhuti Gupta (University of Texas Medical Branch)
A framework for HIPAA-compliant agentic clinical AI that enforces PHI governance through attribute-based access control, a hybrid regex-and-BERT redaction pipeline applied at both pre- and post-inference stages, and immutable audit trails. It addresses compliance vulnerabilities that existing LLM frameworks leave unresolved when agents autonomously handle protected health information.
Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain
Léo Boisvert (ServiceNow Research, Mila -Quebec AI institute, Polytechnique Montréal), Abhay Puri (ServiceNow Research), Chandra Kiran Reddy Evuru (ServiceNow), Nazanin Mohammadi Sepahvand (ServiceNow Research), Nicolas Chapados (Mila -Quebec AI institute, Polytechnique Montréal), Quentin Cappart (Polytechnique Montréal), Alexandre Lacoste (ServiceNow Research), Krishnamurthy Dvijotham (ServiceNow Research), Alexandre Drouin (ServiceNow Research)
Malice in Agentland demonstrates that AI agent supply chains are vulnerable to backdoor attacks at three distinct layers: finetuning data poisoning, pre-backdoored base models, and a novel environment poisoning vector that exploits the agent's interaction with its deployment environment. The attacks are hard to detect and cause agents to behave maliciously only when specific triggers are present.
SAPO: Secure Automated Prompt Optimization via Multi-Agent Collaboration
Emmanuel Aboah Boateng (None), Zachary Johnson (Microsoft), Tian Xia (Microsoft), Sarah Zhang (None), Aidan Jay (Microsoft), Junyao Feng (Microsoft), Aditya Mate (Microsoft), Ehi Nosakhare (Microsoft)
SAPO is a multi-agent prompt optimization framework that treats safety as a first-class constraint, jointly maximizing task performance and robustness to adversarial inputs. It closes the gap left by accuracy-only prompt optimization, which routinely produces prompts vulnerable to jailbreaks and harmful outputs.
The Verifier Tax: Horizon Dependent Safety--Success Tradeoffs in Tool Using LLM Agents
Tanmay Sah (Harrisburg University of Science and Technology), Vishal Srivastava (Johns Hopkins University), Dolly Sah (University of Utah), Kayden Jordan (Harrisburg University of Science and Technology)
An empirical study quantifying the 'Verifier Tax'—the persistent reduction in task success rate caused by adding runtime safety enforcement to tool-using LLM agents. The results show a model-dependent Safety-Capability Gap with interaction horizons of 15–30 turns beyond which safety enforcement dominates, giving practitioners concrete guidance on where safety and capability trade-offs bite.
MoltGraph: A Longitudinal Temporal Graph Dataset of Moltbook for Coordinated-Agent Detection
Kunal Mukherjee (Virginia Polytechnic Institute and State University (Virginia Tech)), Cuneyt Akcora (University of Central Florida), Murat Kantarcioglu (Virginia Polytechnic Institute and State University)
MoltGraph is a longitudinal temporal graph dataset of Moltbook agent interactions with ground-truth coordination labels, capturing heterogeneous interactions, temporal drift, and visibility signals needed to study influence manipulation on AI-native social platforms. It provides the first graph-native dataset for developing and evaluating coordinated inauthentic behavior detection in agent societies.
Securing Agents With Tracked Capabilities
Martin Odersky (EPFL), Yaoyu Zhao (EPFL), Yichen Xu (EPFL), Oliver Bračevac (EPFL), Cao Nguyen Pham (EPFL)
A type-system-based safety harness for AI agents that uses Scala 3's capture checking to statically track which resources and effects an agent can access, preventing prompt injection, data leakage, and unintended side effects at the programming-language level rather than at runtime heuristics.
Exploring and Developing a Pre-Model Safeguard with Draft Models
Hongyu Cai (Purdue University), Arjun Arunasalam (Florida International University), Yiming Liang (Purdue University), Antonio Bianchi (Purdue University), Z. Berkay Celik (Purdue University)
A pre-model jailbreak guard that invokes a draft model to generate a partial response before the target model sees the prompt, enabling safety auditing of both the input and the anticipated output. This dual-signal approach catches attacks that evade prompt-only guards by embedding harmful intent across multiple benign-looking turns.
Who Decides the Trade-off? Resolution Policy as Delegation Governance in Autonomous Agents
Koji YAMAZAKI (Docomo Innovations, Inc)
An empirical study showing that when AI agents face conflicting constraints, today's systems resolve the conflict probabilistically through model sampling—producing outcomes that are unpredictable, irreproducible, and unauditable—and introducing Resolution Policy (the Deterministic Delegation Model) as a formal governance mechanism. DDM makes constraint trade-offs explicit and structurally binding, reducing deviation from 76% to 0% in experiments across two frontier LLMs.
Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use
Francisco Javier Arceo (Red Hat), Varsha Prasad Narsing (Red Hat)
An open-source, vendor-neutral architecture (Llama Stack) for enterprise RAG and agentic systems that enforces multi-tenant isolation, policy-aware access control, and regulatory compliance at the retrieval layer. It addresses a fundamental flaw in standard RAG: relevance-based retrieval can surface one tenant's confidential data to another tenant simply because it scores highest.
When Harmful Intent Dissolves into Technical Detail: How Safe Are Coding Agents Against Cyber Misuse?
Xiangzhe Xu (Purdue University), Shiwei Feng (Purdue University), Guangyu Shen (Purdue University), Xiangyu Zhang (Purdue University)
A safety study of AI coding agents showing that harmful cyber intent embedded in technically plausible step-by-step prompts frequently bypasses agent safety measures, because the agent can execute each individual step without recognizing the malicious downstream consequence of the sequence. The paper introduces a benchmark for this threat category and evaluates frontier coding agents against it.
System Optimization & Efficiency 10 papers
Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints
Jelena Markovic-Voronov (LinkedIn), Kayhan Behdin (LinkedIn), Yuanda Xu (LinkedIn), Zhengze Zhou (LinkedIn), Zhipeng Wang (LinkedIn), Rahul Mazumder (LinkedIn, MIT)
A batch-level LLM routing framework that jointly assigns models to an entire incoming request batch while respecting cost, GPU, and concurrency constraints—rather than routing each query independently. A robust variant explicitly accounts for uncertainty in predicted model quality, preventing adversarial or skewed batches from defeating cost controls.
Scaling Textual Gradients via Sampling-Based Momentum
Zixin Ding (University of Chicago), Junyuan Hong (University of Texas at Austin), Zhan Shi (Santa Clara University), Tianhao Wang (Princeton University), Zinan Lin (Microsoft Research), Li Yin (SylphAI), Meng Liu (SylphAI), Atlas Wang (UT Autsin), Yuxin Chen (University of Chicago)
A method for scaling prompt optimization with LLM-generated textual gradients that introduces sampling-based momentum to overcome context-length limits and instability at large training set sizes. It shows that principled scaling of textual gradient descent—analogous to SGD with momentum—yields consistent gains that naive scaling cannot achieve.
FedMECA: Scalable Federated Learning via Memory-Efficient and Concurrent Aggregation
Zhonghao Chen (University of Florida), Duo Zhang (University of California, Merced), Xiaoyi Lu (University of Florida)
FedMECA is a federated learning aggregation system that decouples model collection from aggregation to overcome scalability failures as client counts or model sizes grow. By enabling concurrent, memory-efficient aggregation, it makes federated training viable at scales where existing systems stall.
Constant-Memory Retrieval via Koopman Operator Estimation for Mamba-3
Alexander Johansen (Stanford University), Anupama Sridhar ()
Spectral Koopman Attention (SKA) is a module for state-space models that eliminates the 'memory cliff'—where retrieval accuracy collapses for long sequences—while maintaining constant memory usage with no KV cache. It fits a spectral linear system to key-value history in closed form, enabling reliable long-range fact retrieval for extended agentic traces on commodity hardware.
FLASC: Federated LoRA with Sparse Communication
Kevin Kuo (Carnegie Mellon University), Arian Raje (Carnegie Mellon University), Kousik Rajesh (Carnegie Mellon University), Virginia Smith (Carnegie Mellon University)
FLASC is a federated LoRA fine-tuning method that combines low-rank adaptation with sparse top-K gradient communication, reducing inter-device communication overhead in cross-device federated learning. Unlike prior sparse LoRA approaches, it avoids the accuracy degradation and counterproductive communication cost increases that come from imposing sparsity alone.
XGrammar++: Dynamic and Efficient Structured Generation Engine for Agentic LLMs
Linzhang Li (Shanghai Jiao Tong University), Yixin Dong (Carnegie Mellon University), Guanjie Wang (Shanghai Jiao Tong University), Ziyi Xu (Shanghai Jiao Tong University), Alexander Jiang (Carnegie Mellon University), Tianqi Chen (Carnegie Mellon University)
XGrammar++ is a structured generation engine for agentic LLM workloads that efficiently handles dynamic, variable tool-calling schemas by supporting tag-triggered structure switching and fine-grained cross-request reuse. It substantially reduces latency and overhead for production systems that mix diverse tool-call schemas within and across requests.
CAMI: Cost-Aware Agent-Guided Multi-Indexing for Semantic Retrieval
Adnan Qidwai (IBM Research - India), Anand Eswaran (IBM Research - India), Sonam Mishra (IBM Research - India), Jaydeep Sen (IBM Research - India), Sachindra Joshi (IBM Research - India)
CAMI is a cost-aware retrieval system that uses an agent to intelligently select which semantic enrichments—hypothetical queries, summaries, paraphrases—to generate per document chunk at index time, optimizing retrieval quality within a practical cost budget. It avoids the combinatorial explosion of exhaustively generating all enrichment types across large corpora.
SwiftFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs
Jiacheng Yang (University of Toronto), Jun Wu (Amazon), Yaoyao Ding (NVIDIA/Univerisity of Toronto), Zhiying Xu (Amazon Web Services), Yida Wang (Amazon), Gennady Pekhimenko (NVIDIA/University of Toronto)
SwiftFusion is a sequence parallelism system for distributed diffusion transformer inference that reduces latency by optimizing inter-GPU communication patterns and fusing all-to-all operations with attention computation. It enables high-resolution image and long video generation to scale efficiently across multi-GPU and multi-node configurations.
AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices
Dzung Pham (University of Massachusetts Amherst), Kleomenis Katevas (Brave Software), Ali Shahin Shamsabadi (Brave Software), Hamed Haddadi (Brave Software, Imperial College London)
AgentStop reduces energy consumption of locally deployed LLM agents on consumer devices by predicting task completion likelihood mid-execution and terminating unproductive branches early. The system saves substantial energy with minimal impact on task success rates, making local agentic deployment practical on battery-powered hardware.
Dissecting and Improving Communication Performance in Multi-Node LLM Inference
Prajwal Singhania (University of Maryland), Siddharth Singh (University of Maryland), Lannie Dalton Hough (University of Maryland), Akarsh Srivastava (University of Maryland), Harshitha Menon (Lawrence Livermore National Laboratory), Charles Fredrick Jekel (Lawrence Livermore National Laboratory), Abhinav Bhatele (University of Maryland)
A detailed performance study of multi-node distributed LLM inference on GPU clusters that characterizes communication bottlenecks across model-parallel strategies—tensor, pipeline, and sequence parallelism—at scale. The results identify the dominant sources of inter-node communication overhead and provide optimization strategies validated on state-of-the-art inference engines.
Engineering & Operations 3 papers
SEAR: Schema-Based Evaluation and Routing for LLM Gateways
Zecheng Zhang (Strukto.AI), Han Zheng (Infron.AI), Yue Xu (Infron.AI)
SEAR is a production evaluation and routing system for multi-model LLM gateways that exposes ~100 typed, SQL-queryable quality signals—covering intent, response characteristics, issue attribution, and scores—to drive fine-grained routing decisions across providers. It makes LLM gateway behavior observable and steerable in a way that provider-level metrics alone cannot.
Supervisory Control Theory for LLM Revision
Carlos Toxtli (Clemson University), Wangfan Li (Clemson University)
PLSA is a structured prompting framework that applies Supervisory Control Theory—a cognitive model of human oversight of automated systems—to guide LLM iterative self-revision. In large-scale evaluation across ML conference paper revision tasks, SCT-structured prompts produce revisions with significantly higher fidelity than matched standard self-refinement baselines.
Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study
Srikanta Prasad Sondekoppam Vijayashankar (Salesforce India Pvt Ltd), Utkarsh Arora (Salesforce India Pvt Ltd)
A production deployment study from Salesforce describing the modular, platform-agnostic inference architecture behind Agentforce and ApexGuru, sharing hard-won lessons on serving concurrent, heterogeneous compound AI workloads at enterprise scale. The paper provides concrete architectural patterns for cost-effective, low-latency multi-model serving that academic treatments typically omit.