Skip to main content
Registration has reached capacity. Join the waitlist

All Accepted Papers

FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

Igor Bogdanov (Carleton University), Chung-Horng Lung (Carleton University), Thomas Kunz (Carleton University), Jie Gao (Carleton University), Adrian Taylor (Defence R&D Canada), Marzia Zaman (Cistel Technology)

Architectural Patterns & Composition

FORGE is a population-based protocol in which LLM agents self-generate and evolve natural-language memory—textual heuristics and few-shot demonstrations—through reflection and competitive selection across episodes. It improves agent decision-making over time with no gradient updates, using only the same base LLM that the agent runs on.

Presentation

Talk

Paper Session 4: Agent Memory & Planning

Thursday, May 28 · 9:40 AM – 9:50 AM

Bayshore Ballroom

Poster

Thursday, May 28 · 4:30 PM – 6:00 PM

Carmel

Abstract

Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (RULES), few-shot demonstrations (EXAMPLES), or both (MIXED), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B_line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7× over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below -100) to as low as ~1%. We find that (1) population broadcast is the critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) EXAMPLES achieves the strongest returns for three of four models, while RULES offers the best cost-reliability profile with ~40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B_line; cross-family findings are directional evidence.

ACM CAIS 2026 Sponsors