Paper Detail

Action Images: End-to-End Policy Learning via Multiview Video Generation

Haoyu Zhen, Zixian Gao, Qiao Sun, Yilin Zhao, Yuncong Yang, Yilun Du, Tsun-Hsuan Wang, Yi-Ling Qiao, Chuang Gan

arxiv Score 3.8

Published 2026-04-07 · First seen 2026-04-08

General AI

Abstract

World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
later
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{zhen2026action,
  title = {Action Images: End-to-End Policy Learning via Multiview Video Generation},
  author = {Haoyu Zhen and Zixian Gao and Qiao Sun and Yilin Zhao and Yuncong Yang and Yilun Du and Tsun-Hsuan Wang and Yi-Ling Qiao and Chuang Gan},
  year = {2026},
  abstract = {World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model th},
  url = {https://arxiv.org/abs/2604.06168},
  keywords = {cs.CV, cs.RO, world action models, video backbones, future states, robot policy learning, action representations, pixel-grounded, multiview video generation, zero-shot policy, action images, 7-DoF robot actions, video-action joint generation, action-conditioned video generation, action labeling, huggingface daily},
  eprint = {2604.06168},
  archiveprefix = {arXiv},
}

Metadata

{}