Paper Detail

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Tianshuo Yang, Guanyu Chen, Yutian Chen, Zhixuan Liang, Yitian Liu, Zanxin Chen, Chunpu Xu, Haotian Liang, Jiangmiao Pang, Yao Mu, Ping Luo

huggingface Score 11.5

Published 2026-04-15 · First seen 2026-04-17

General AI

Abstract

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{yang2026hivla,
  title = {HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System},
  author = {Tianshuo Yang and Guanyu Chen and Yutian Chen and Zhixuan Liang and Yitian Liu and Zanxin Chen and Chunpu Xu and Haotian Liang and Jiangmiao Pang and Yao Mu and Ping Luo},
  year = {2026},
  abstract = {While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs tas},
  url = {https://huggingface.co/papers/2604.14125},
  keywords = {Vision-Language-Action models, Vision-Language Models, diffusion models, Diffusion Transformer, cross-attention mechanism, cascaded cross-attention, task decomposition, visual grounding, structured plans, bounding box, motor control, semantic planning, zero-shot reasoning, skill composition, fine-grained manipulation, cluttered scenes, huggingface daily},
  eprint = {2604.14125},
  archiveprefix = {arXiv},
}

Metadata

{}