Skip to main content
Registration has reached capacity. Join the waitlist

All Accepted Papers

Composing Policy Gradients and Prompt Optimization for Language Model Programs

Noah Ziems (University of Notre Dame), Dilara Soylu (Stanford University), Lakshya A Agrawal (UC Berkeley), Isaac Miller (Anyscale), Liheng Lai (UC Berkeley), Chen Qian (CMU), Kaiqiang Song (Zoom, Inc.), Meng Jiang (University of Notre Dame), Dan Klein (UC Berkeley), Matei Zaharia (UC Berkeley), Karel D’Oosterlinck (Contextual AI), Christopher Potts (Stanford University), Omar Khattab (MIT)

Architectural Patterns & Composition

A generalization of GRPO to modular multi-prompt LLM programs that enables RL post-training across agent systems with multiple LM calls, variable-length trajectories, and interrupted rollouts. The paper shows for the first time that RL training and automatic prompt optimization compose well together, jointly improving accuracy by 11% on average.

Presentation

Talk

Paper Session 6: Learning & Control

Thursday, May 28 · 3:40 PM – 3:50 PM

Bayshore Ballroom

Poster

Thursday, May 28 · 4:30 PM – 6:00 PM

Carmel

Abstract

Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how practitioners can best leverage online RL algorithms like GRPO to improve these systems. We begin to address this challenge by investigating whether it is possible to effectively instantiate GRPO for arbitrary multi-prompt programs and whether it can work robustly as an off-the-shelf optimizer for LM programs using the same abstractions and constraints typically involved for prompt optimization. Our main variant of multi-module GRPO constructs groups from module-level invocations, and we also consider trajectory-level grouping as another natural instantiation. We find for the first time that GRPO (and its multi-module counterpart) empirically composes well with automatic prompt optimization, and together they improve accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM—with 5% gains against prompt optimization on its own. We open-source mmGRPO in the DSPy library at dspy.ai.

ACM CAIS 2026 Sponsors