RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks
Mian Wu, Gavin Zhang, Sewon Min, Sergey Levine, Aviral Kumar
arXiv preprint, 2025
We propose a post-training methodology for improving language models on open-ended generation tasks. Instead of relying on static reward models, RLAC trains a dynamic LLM critic alongside the generator using adversarial gameplay. The critic identifies the most likely failure modes which are then validated externally, adapting as the generator improves. We demonstrate improvements in factual accuracy for text generation and correctness for code generation across multiple benchmarks.
Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
Qi Wang*, Mian Wu*, Yuyang Zhang*, Mingqi Yuan, Wenyao Zhang, Haoxiang You, Yunbo Wang, Xin Jin, Xiaokang Yang, Wenjun Zeng
Under review at CVPR 2026
We introduce a demo-free reward framework that leverages pretrained video diffusion models to generate goal-conditioned reward signals for RL agents, eliminating the need for hand-crafted reward design or expert demonstrations.