Paper Detail

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li, Lingdong Kong, Yingyan Li, Han Wang, Shaoqing Xu, Yuechen Luo, Fang Li, Chenxu Dang, Junli Wang, Tao Xu, Jing Wu, Jianhua Wu, Xiaoshuai Hao, Wen Zhang, Tianyi Jiang, Lingfeng Zhang, Lei Zhou, Yingbo Tang, Jie Wang, Yinfeng Gao, Xizhou Bu, Haochen Tian, Yihang Qiu, Feiyang Jia, Lin Liu, Yigu Ge, Hanbing Li, Yuannan Shen, Jianwei Cui, Hongwei Xie, Bing Wang, Haiyang Sun, Jingwei Zhao, Jiahui Huang, Pei Liu, Zeyu Zhu, Yuncheng Jiang, Zibin Guo, Chuhong Gong, Hanchao Leng, Kun Ma, Naiyang Wang, Guang Chen, Kuiyuan Yang, Hangjun Ye, Long Chen

arxiv Score 17.3

Published 2026-04-20 · First seen 2026-04-21

General AI

Abstract

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{lu2026onevl,
  title = {OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation},
  author = {Jinghui Lu and Jiayi Guan and Zhijian Huang and Jinlong Li and Guang Li and Lingdong Kong and Yingyan Li and Han Wang and Shaoqing Xu and Yuechen Luo and Fang Li and Chenxu Dang and Junli Wang and Tao Xu and Jing Wu and Jianhua Wu and Xiaoshuai Hao and Wen Zhang and Tianyi Jiang and Lingfeng Zhang and Lei Zhou and Yingbo Tang and Jie Wang and Yinfeng Gao and Xizhou Bu and Haochen Tian and Yihang Qiu and Feiyang Jia and Lin Liu and Yigu Ge and Hanbing Li and Yuannan Shen and Jianwei Cui and Hongwei Xie and Bing Wang and Haiyang Sun and Jingwei Zhao and Jiahui Huang and Pei Liu and Zeyu Zhu and Yuncheng Jiang and Zibin Guo and Chuhong Gong and Hanchao Leng and Kun Ma and Naiyang Wang and Guang Chen and Kuiyuan Yang and Hangjun Ye and Long Chen},
  year = {2026},
  abstract = {Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world,},
  url = {https://arxiv.org/abs/2604.18486},
  keywords = {cs.CV, cs.CL, cs.RO},
  eprint = {2604.18486},
  archiveprefix = {arXiv},
}

Metadata

{}