SAPO: Secure Automated Prompt Optimization via Multi-Agent Collaboration
Emmanuel Aboah Boateng (None), Zachary Johnson (Microsoft), Tian Xia (Microsoft), Sarah Zhang (None), Aidan Jay (Microsoft), Junyao Feng (Microsoft), Aditya Mate (Microsoft), Ehi Nosakhare (Microsoft)
Security & Privacy Architectural Patterns & Composition
Abstract
Prompt optimization is essential for deploying language models in specialized tasks, yet existing automated prompt optimization methods focus almost exclusively on task performance while treating safety as an afterthought. This gap is consequential: prompts optimized purely for accuracy can become susceptible to adversarial inputs that elicit harmful, biased, or confidential outputs. We introduce SAPO (Secure Automated Prompt Optimization), a multi-agent framework that formulates prompt optimization as a constrained multi-objective problem, maximizing task performance subject to explicit security constraints. SAPO coordinates four specialized agents through a central orchestrator: a Prompt Generation Agent for candidate creation, a Security Check Agent for adversarial robustness evaluation, a Performance Evaluation Agent for task accuracy measurement, and a Critic Agent that synthesizes cross-agent feedback to adaptively rebalance optimization weights across iterations. A security constraint ensures that candidate prompts that exceed a minimum security and overall score threshold are favored for selection. The framework extends naturally to model migration scenarios where prompts must transfer across model families without sacrificing safety or performance. Experiments across six tasks from the Instruction Induction and BIG-Bench Hard benchmarks, evaluated against HarmBench for adversarial robustness, demonstrate that SAPO achieves perfect security scores while simultaneously achieving the highest aggregated task-accuracy score by at least 2.6% over single-objective baseline methods.