Paper Detail

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

Yi Wang, Xinchen Li, Pengwei Xie, Pu Yang, Buqing Nie, Yunuo Cai, Qinglin Zhang, Chendi Qu, Jeffrey Wu, Jianheng Song, Xinlin Ren, Jingshun Huang, Mingjie Pan, Siyuan Feng, Zhi Chen, Jianlan Luo

Browse

Workflow Queues

huggingface Score 3.5

Published 2026-05-01 · First seen 2026-05-04

General AI

Open paper source

Abstract

Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation with Q-learning via Adjoint Matching (QAM) for policy extraction in flow-based VLA action generators. We validate LWD on a fleet of 16 dual-arm robots across eight real-world manipulation tasks, including semantic grocery restocking and 3--5 minute long-horizon tasks. A single generalist policy improves as fleet experience accumulates, reaching an average success rate of 95%, with the largest gains on long-horizon tasks.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: later
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{wang2026learning,
  title = {Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies},
  author = {Yi Wang and Xinchen Li and Pengwei Xie and Pu Yang and Buqing Nie and Yunuo Cai and Qinglin Zhang and Chendi Qu and Jeffrey Wu and Jianheng Song and Xinlin Ren and Jingshun Huang and Mingjie Pan and Siyuan Feng and Zhi Chen and Jianlan Luo},
  year = {2026},
  abstract = {Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (V},
  url = {https://huggingface.co/papers/2605.00416},
  keywords = {Vision-Language-Action, reinforcement learning, policy improvement, autonomous rollouts, human interventions, Distributional Implicit Value Learning, Q-learning, Adjoint Matching, flow-based action generators, huggingface daily},
  eprint = {2605.00416},
  archiveprefix = {arXiv},
}

Metadata

{}