Paper Detail

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

Chang-Bin Zhang, Yujie Zhong, Qiang Zhang, Kai Han

huggingface Score 19.0

Published 2026-05-29 · First seen 2026-06-01

General AI

Abstract

While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding introduces unnecessary interference with the model's primary objective of answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (iVGR), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality visually grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{zhang2026ivgr,
  title = {iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning},
  author = {Chang-Bin Zhang and Yujie Zhong and Qiang Zhang and Kai Han},
  year = {2026},
  abstract = {While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capabil},
  url = {https://huggingface.co/papers/2605.31096},
  keywords = {Chain-of-Thought, multimodal large language models, visually grounded reasoning, reinforcement learning, dual-stream training, consistency reward, fine-grained perception, code available, huggingface daily},
  eprint = {2605.31096},
  archiveprefix = {arXiv},
}

Metadata

{}