Registration has reached capacity. Join the waitlist

Composing Policy Gradients and Prompt Optimization for Language Model Programs

Noah Ziems (University of Notre Dame), Dilara Soylu (Stanford University), Lakshya A Agrawal (UC Berkeley), Isaac Miller (Anyscale), Liheng Lai (UC Berkeley), Chen Qian (CMU), Kaiqiang Song (Zoom, Inc.), Meng Jiang (University of Notre Dame), Dan Klein (UC Berkeley), Matei Zaharia (UC Berkeley), Karel D’Oosterlinck (Contextual AI), Christopher Potts (Stanford University), Omar Khattab (MIT)

Architectural Patterns & Composition

A generalization of GRPO to modular multi-prompt LLM programs that enables RL post-training across agent systems with multiple LM calls, variable-length trajectories, and interrupted rollouts. The paper shows for the first time that RL training and automatic prompt optimization compose well together, jointly improving accuracy by 11% on average.

Presentation

Talk

Paper Session 6: Learning & Control

Thursday, May 28 · 3:40 PM – 3:50 PM

Bayshore Ballroom

Poster

Thursday, May 28 · 4:30 PM – 6:00 PM

Carmel

View day schedule

Abstract

Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how practitioners can best leverage online RL algorithms like GRPO to improve these systems. We begin to address this challenge by investigating whether it is possible to effectively instantiate GRPO for arbitrary multi-prompt programs and whether it can work robustly as an off-the-shelf optimizer for LM programs using the same abstractions and constraints typically involved for prompt optimization. Our main variant of multi-module GRPO constructs groups from module-level invocations, and we also consider trajectory-level grouping as another natural instantiation. We find for the first time that GRPO (and its multi-module counterpart) empirically composes well with automatic prompt optimization, and together they improve accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM—with 5% gains against prompt optimization on its own. We open-source mmGRPO in the DSPy library at dspy.ai.

Artifacts & Links

                        Authors
                        Noah Ziems
University of Notre Dame
Dilara Soylu
Stanford University
Lakshya A Agrawal
UC Berkeley
Isaac Miller
Anyscale
Liheng Lai
UC Berkeley
Chen Qian
CMU
Kaiqiang Song
Zoom, Inc.
Meng Jiang
University of Notre Dame
Dan Klein
UC Berkeley
Matei Zaharia
UC Berkeley
Karel D’Oosterlinck
Contextual AI
Christopher Potts
Stanford University
Omar Khattab
MIT