Paper Detail

Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

Junhao Shi, Siyin Wang, Xiaopeng Yu, Li Ji, Jingjing Gong, Xipeng Qiu

Browse

Workflow Queues

huggingface Score 6.8

Published 2026-07-02 · First seen 2026-07-03

General AI

Open paper source

Abstract

Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations -- triplets of observations, instructions, and actions that are costly to collect at scale. We argue that this bottleneck stems from conflating two distinct learning objectives: acquiring physical competence (how to move) and acquiring semantic alignment (what to do). Crucially, only the latter requires language supervision. Building on this Decomposition Hypothesis, we propose Task-Agnostic Pretraining (TAP), a two-stage framework that first learns transferable motor priors from cheap, unlabeled interaction data -- including discarded off-task trajectories and autonomous robot play -- via a self-supervised Inverse Dynamics objective. A lightweight second stage then grounds these priors in language using minimal expert data. On the SIMPLER benchmark, TAP matches models trained on over 1M expert trajectories while using orders of magnitude less labeled data, yielding a 10% absolute gain over standard behavior cloning. On a real-world WidowX platform, TAP retains 25% success under camera perturbations where internet-scale baselines collapse to 0%, demonstrating that task-agnostic pretraining produces robust, transferable physical representations and offers a scalable path forward for Embodied AI.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: later
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{shi2026learning,
  title = {Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs},
  author = {Junhao Shi and Siyin Wang and Xiaopeng Yu and Li Ji and Jingjing Gong and Xipeng Qiu},
  year = {2026},
  abstract = {Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations -- triplets of observations, instructions, and actions that are costly to collect at scale. We argue that this bottleneck stems from conflating two distinct learning objectives: acquiring physical competence (how to move) and acquiring semantic alignment (what to do). Crucially, only the latter requires language supervision. Building on this Decomposition Hypothesis, we propose Task-Agnost},
  url = {https://huggingface.co/papers/2607.02466},
  keywords = {Vision-Language-Action models, expert demonstrations, physical competence, semantic alignment, self-supervised Inverse Dynamics, task-agnostic pretraining, behavior cloning, SIMPLER benchmark, WidowX platform, code available, huggingface daily},
  eprint = {2607.02466},
  archiveprefix = {arXiv},
}

Metadata

{}