Paper Detail

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Yatai Ji, An-Chieh Cheng, Yang Fu, Yukang Chen, Han Zhang, Zhaojing Yang, Wei Huang, Ka Chun Cheung, Song Han, Vidya Nariyambut Murali, Pavlo Molchanov, Jan Kautz, Simon See, Hongxu Yin, Ping Luo, Sifei Liu

Browse

Workflow Queues

huggingface Score 11.5

Published 2026-06-16 · First seen 2026-06-18

General AI

Open paper source

Abstract

Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{ji2026reinforcing,
  title = {Reinforcing Dual-Path Reasoning in Spatial Vision Language Models},
  author = {Yatai Ji and An-Chieh Cheng and Yang Fu and Yukang Chen and Han Zhang and Zhaojing Yang and Wei Huang and Ka Chun Cheung and Song Han and Vidya Nariyambut Murali and Pavlo Molchanov and Jan Kautz and Simon See and Hongxu Yin and Ping Luo and Sifei Liu},
  year = {2026},
  abstract = {Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial},
  url = {https://huggingface.co/papers/2606.17539},
  keywords = {spatial VLMs, reinforcement learning, language-only reasoning, detect-then-reason, chain-of-thought supervision, region tokens, 3D geometric cues, discrete center-based detection, cold-start supervised fine-tuning, policy model, accuracy rewards, format rewards, geometric alignment, joint training, positive transfer, code available, huggingface daily},
  eprint = {2606.17539},
  archiveprefix = {arXiv},
}

Metadata

{}