GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization
Journal:
arXiv
Published Date:
Mar 26, 2025
Abstract
Recent advances in large language models have highlighted the critical need
for precise control over model outputs through predefined constraints. While
existing methods attempt to achieve this through either direct
instruction-response synthesis or preferential response optimization, they
often struggle with constraint understanding and adaptation. This limitation
becomes particularly evident when handling fine-grained constraints, leading to
either hallucination or brittle performance. We introduce Generative
Adversarial Policy Optimization (GAPO), a novel framework that combines
GAN-based training dynamics with an encoder-only reward model to progressively
learn and adapt to increasingly complex constraints. GAPO leverages adversarial
training to automatically generate training samples of varying difficulty while
utilizing the encoder-only architecture to better capture prompt-response
relationships. Extensive experiments demonstrate GAPO's superior performance
across multiple benchmarks, particularly in scenarios requiring fine-grained
constraint handling, where it significantly outperforms existing methods like
PPO, DPO, and KTO. Our results suggest that GAPO's unique approach to
preferential prompt learning offers a more robust and effective solution for
controlling LLM outputs. Code is avaliable in
https://github.com/MikeGu721/GAPO.