Paper Detail

Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

Yixu Huang, Tinghui Zhu, Muhao Chen

huggingface Score 8.5

Published 2026-04-16 · First seen 2026-04-20

General AI

Abstract

Visual reasoning models (VRMs) have recently shown strong cross-modal reasoning capabilities by integrating visual perception with language reasoning. However, they often suffer from overthinking, producing unnecessarily long reasoning chains for any tasks. We attribute this issue to Reasoning Path Redundancy in visual reasoning: many visual questions do not require the full reasoning process. To address this, we propose AVR, an adaptive visual reasoning framework that decomposes visual reasoning into three cognitive functions: visual perception, logical reasoning, and answer application. It further enables models to dynamically choose among three response formats: Full Format, Perception-Only Format, and Direct Answer. AVR is trained with FS-GRPO, an adaptation of Group Relative Policy Optimization that encourages the model to select the most efficient reasoning format while preserving correctness. Experiments on multiple vision-language benchmarks show that AVR reduces token usage by 50--90\% while maintaining overall accuracy, especially in perception-intensive tasks. These results demonstrate that adaptive visual reasoning can effectively mitigate overthinking in VRMs. Code and data are available at: https://github.com/RunRiotComeOn/AVR.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{huang2026learning,
  title = {Learning Adaptive Reasoning Paths for Efficient Visual Reasoning},
  author = {Yixu Huang and Tinghui Zhu and Muhao Chen},
  year = {2026},
  abstract = {Visual reasoning models (VRMs) have recently shown strong cross-modal reasoning capabilities by integrating visual perception with language reasoning. However, they often suffer from overthinking, producing unnecessarily long reasoning chains for any tasks. We attribute this issue to Reasoning Path Redundancy in visual reasoning: many visual questions do not require the full reasoning process. To address this, we propose AVR, an adaptive visual reasoning framework that decomposes visual reasonin},
  url = {https://huggingface.co/papers/2604.14568},
  keywords = {visual reasoning models, cross-modal reasoning, reasoning path redundancy, adaptive visual reasoning, visual perception, logical reasoning, answer application, FS-GRPO, Group Relative Policy Optimization, vision-language benchmarks, token usage, code available, huggingface daily},
  eprint = {2604.14568},
  archiveprefix = {arXiv},
}

Metadata

{}