Paper Detail

StarVLA-$α$: Reducing Complexity in Vision-Language-Action Systems

Jinhui Ye, Ning Gao, Senqiao Yang, Jinliang Zheng, Zixuan Wang, Yuxin Chen, Pengguang Chen, Yilun Chen, Shu Liu, Jiaya Jia

Browse

Workflow Queues

arxiv Score 11.3

Published 2026-04-13 · First seen 2026-04-14

General AI

Open paper source

Abstract

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA-$α$, a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA-$α$ deliberately minimizes architectural and pipeline complexity to reduce experimental confounders and enable systematic analysis. Specifically, we re-evaluate several key design axes, including action modeling strategies, robot-specific pretraining, and interface engineering. Across unified multi-benchmark training on LIBERO, SimplerEnv, RoboTwin, and RoboCasa, the same simple baseline remains highly competitive, indicating that a strong VLM backbone combined with minimal design is already sufficient to achieve strong performance without relying on additional architectural complexity or engineering tricks. Notably, our single generalist model outperforms $π_{0.5}$ by 20\% on the public real-world RoboChallenge benchmark. We expect StarVLA-$α$ to serve as a solid starting point for future research in the VLA regime. Code will be released at https://github.com/starVLA/starVLA.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{ye2026starvla,
  title = {StarVLA-\$α\$: Reducing Complexity in Vision-Language-Action Systems},
  author = {Jinhui Ye and Ning Gao and Senqiao Yang and Jinliang Zheng and Zixuan Wang and Yuxin Chen and Pengguang Chen and Yilun Chen and Shu Liu and Jiaya Jia},
  year = {2026},
  abstract = {Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA-\$α\$, a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA-\$α\$ deliberately minimizes},
  url = {https://arxiv.org/abs/2604.11757},
  keywords = {cs.RO, cs.AI, cs.CV},
  eprint = {2604.11757},
  archiveprefix = {arXiv},
}

Metadata

{}