RLAC: Reinforcement Learning with Adversarial Critic

RLAC: REINFORCEMENT LEARNING WITH ADVERSARIAL CRITIC FOR FREE-FORM GENERATION TASKS

Mian Wu^†1Gavin Zhang²Sewon Min² Sergey Levine² Aviral Kumar³

¹Shanghai Jiao Tong University²University of California, Berkeley³Carnegie Mellon University

TL;DR

Instead of using static reward models or critics, RLAC trains a dynamic critic alongside the generator (RL policy), using an adversarial two-player game formulation. This enables verifying outputs on free-form generation tasks without needing to enumerate or identify all possible rubrics or manually engineer robust reward models

Abstract

Open-ended generation tasks require outputs to satisfy diverse and often implicit task-specific evaluation rubrics. The sheer number of relevant rubrics leads to prohibitively high verification costs and incomplete assessments of a response, making reinforcement learning (RL) post-training with rubric-based rewards difficult to scale. This problem is exacerbated by the fact that often the best way to combine these rubrics into one single reward is also highly prompt-specific. We propose Reinforcement Learning with Adversarial Critic (RLAC), a post-training approach that addresses these challenges via dynamic rubric verification. Our approach employs a large language model (LLM) as a critic that dynamically identifies only the most likely failure modes (e.g., a factual error or unhandled edge case), which are then verified by an external validator to optimize both generator and critic jointly. By training both the generator and the critic, this game enhances the critic’s error detection and the generator’s output quality while reducing required verifications. Our experiments demonstrate that RLAC improves factual accuracy in text generation and correctness in code generation, while also outperforming exhaustive verification and reward model methods. We show that dynamic critics are more effective than fixed critics, showcasing the potential of RLAC for scaling RL post-training to free-form generation tasks.

Enumerative Method

Reward Model

RLAC(Ours)

Enumerative methods verify outputs by enumerating evaluation rubrics exhaustively or approximately; reward-model approaches replace explicit verification with a pretrained model that outputs scalar rewards; and RLAC dynamically identifies and validates likely errors.

Evaluation

Performance Comparison on Factual Text Generation

We perform RL training with two policy models, Qwen3-4B and Qwen3-8B. RLAC achieves the highest FactScore across both model sizes and generation lengths. These results demonstrate that RLAC scales more efficiently with increasing generation complexity while preserving factual accuracy.

Performance Comparison on Code Generation

We perform RL training with two policy models, Qwen2.5-Coder-7B-Base and Qwen2.5-Coder-7B-Instruct. RLAC achieves the highest average Pass@1 across HumanEval, MBPP, BigCodeBench, and LiveCodeBench benchmarks, outperforming both enumerative (AceCoder-Rule) and reward-model (AceCoder-RM) baselines.

Adversarial Critic Matters

We compare static and adversarial training of the critic to evaluate its role in guiding the generator. As shown in Table 3, the adversarially trained critic increases the number of correct facts (21.6 vs. 17.8) while keeping errors low. In contrast, the static critic achieves a slightly higher FactScore by reducing the number of generated facts, indicating over-precision rather than genuine improvement. These results highlight that dynamic, adversarial critic training is crucial, it continuously adapts to the generator’s behavior, preventing reward hacking and sustaining meaningful supervision.

Reference

BiTex

@misc{wu2025rlacreinforcementlearningadversarial,
      title={RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks}, 
      author={Mian Wu and Gavin Zhang and Sewon Min and Sergey Levine and Aviral Kumar},
      year={2025},
      eprint={2511.01758},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.01758}, 
}