Paper Detail

Visual Reasoning through Tool-supervised Reinforcement Learning

Qihua Dong, Gozde Sahin, Pei Wang, Zhaowei Cai, Robik Shrestha, Hao Yang, Davide Modolo

Browse

Workflow Queues

huggingface Score 15.5

Published 2026-04-21 · First seen 2026-04-23

General AI

Open paper source

Abstract

In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Models. To achieve that, we propose a novel Tool-supervised Reinforcement Learning (ToolsRL) framework, with direct tool supervision for more effective tool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. A reinforcement learning curriculum is developed, where the first stage is solely optimized by a set of well motivated tool-specific rewards, and the second stage is trained with the accuracy targeted rewards while allowing calling tools. In this way, tool calling capability is mastered before using tools to complete visual reasoning tasks, avoiding the potential optimization conflict among those heterogeneous tasks. Our experiments have shown that the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities for complex visual reasoning tasks.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{dong2026visual,
  title = {Visual Reasoning through Tool-supervised Reinforcement Learning},
  author = {Qihua Dong and Gozde Sahin and Pei Wang and Zhaowei Cai and Robik Shrestha and Hao Yang and Davide Modolo},
  year = {2026},
  abstract = {In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Models. To achieve that, we propose a novel Tool-supervised Reinforcement Learning (ToolsRL) framework, with direct tool supervision for more effective tool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. A reinforc},
  url = {https://huggingface.co/papers/2604.19945},
  keywords = {Tool-supervised Reinforcement Learning, multimodal large language models, visual reasoning tasks, tool-use learning, reinforcement learning curriculum, tool-specific rewards, accuracy targeted rewards, tool calling capability, huggingface daily},
  eprint = {2604.19945},
  archiveprefix = {arXiv},
}

Metadata

{}