Paper Detail

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

Bingda Tang, Yuhui Zhang, Xiaohan Wang, Jiayuan Mao, Ludwig Schmidt, Serena Yeung-Levy

huggingface Score 8.0

Published 2026-04-25 · First seen 2026-04-30

General AI

Abstract

Aligning denoising generative models with human preferences or verifiable rewards remains a key challenge. While policy-gradient online reinforcement learning (RL) offers a principled post-training framework, its direct application is hindered by the intractable likelihoods of these models. Prior work therefore either optimizes an induced Markov decision process (MDP) over sampling trajectories, which is stable but inefficient, or uses likelihood surrogates based on the diffusion evidence lower bound (ELBO), which have so far underperformed on visual generation. Our key insight is that the ELBO-based approach can, in fact, be made both stable and efficient. By reducing surrogate variance and controlling gradient steps, we show that this approach can beat MDP-based methods. To this end, we introduce Variational GRPO (V-GRPO), a method that integrates ELBO-based surrogates with the Group Relative Policy Optimization (GRPO) algorithm, alongside a set of simple yet essential techniques. Our method is easy to implement, aligns with pretraining objectives, and avoids the limitations of MDP-based methods. V-GRPO achieves state-of-the-art performance in text-to-image synthesis, while delivering a 2times speedup over MixGRPO and a 3times speedup over DiffusionNFT.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{tang2026v,
  title = {V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think},
  author = {Bingda Tang and Yuhui Zhang and Xiaohan Wang and Jiayuan Mao and Ludwig Schmidt and Serena Yeung-Levy},
  year = {2026},
  abstract = {Aligning denoising generative models with human preferences or verifiable rewards remains a key challenge. While policy-gradient online reinforcement learning (RL) offers a principled post-training framework, its direct application is hindered by the intractable likelihoods of these models. Prior work therefore either optimizes an induced Markov decision process (MDP) over sampling trajectories, which is stable but inefficient, or uses likelihood surrogates based on the diffusion evidence lower },
  url = {https://huggingface.co/papers/2604.23380},
  keywords = {denoising generative models, policy-gradient, reinforcement learning, Markov decision process, diffusion evidence lower bound, variational inference, Group Relative Policy Optimization, text-to-image synthesis, surrogate variance, gradient steps, code available, huggingface daily},
  eprint = {2604.23380},
  archiveprefix = {arXiv},
}

Metadata

{}