Paper Detail

STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation

Yuxuan Tian, Yurun Jin, Bin Yu, Yukun Shi, Hao Wu, Chi Harold Liu, Kai Chen, Cong Huang

Browse

Workflow Queues

arxiv Score 4.3

Published 2026-04-29 · First seen 2026-04-30

General AI

Open paper source

Abstract

Robotic manipulation critically requires reasoning about future spatial-temporal interactions, yet existing VLA policies and world-model-enhanced policies do not fully model action-relevant spatial-temporal interaction structure. We propose STARRY, a world-model-enhanced action-generation policy that aligns spatial-temporal prediction with action generation. STARRY jointly denoises future spatial-temporal latents and action sequences, and introduces Geometry-Aware Selective Attention Modulation to convert predicted depth and end-effector geometry into token-aligned weights for selective action-attention modulation. On RoboTwin 2.0, STARRY achieves 93.82% / 93.30% average success under Clean and Randomized settings. Real-world experiments further improve average success from 42.5% to 70.8% over $π_{0.5}$, demonstrating the effectiveness of action-centric spatial-temporal world modeling for spatial-temporally demanding robotic action generation.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: soon
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{tian2026starry,
  title = {STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation},
  author = {Yuxuan Tian and Yurun Jin and Bin Yu and Yukun Shi and Hao Wu and Chi Harold Liu and Kai Chen and Cong Huang},
  year = {2026},
  abstract = {Robotic manipulation critically requires reasoning about future spatial-temporal interactions, yet existing VLA policies and world-model-enhanced policies do not fully model action-relevant spatial-temporal interaction structure. We propose STARRY, a world-model-enhanced action-generation policy that aligns spatial-temporal prediction with action generation. STARRY jointly denoises future spatial-temporal latents and action sequences, and introduces Geometry-Aware Selective Attention Modulation },
  url = {https://arxiv.org/abs/2604.26848},
  keywords = {cs.RO},
  eprint = {2604.26848},
  archiveprefix = {arXiv},
}

Metadata

{}