Paper Detail

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Yutong Bian, Dongjie Cheng, Heming Xia, Yongqi Li, Wenjie Li

huggingface Score 16.5

Published 2026-06-08 · First seen 2026-06-09

General AI

Abstract

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{bian2026optical,
  title = {Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text},
  author = {Yutong Bian and Dongjie Cheng and Heming Xia and Yongqi Li and Wenjie Li},
  year = {2026},
  abstract = {Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this},
  url = {https://huggingface.co/papers/2606.09585},
  keywords = {Chain-of-Thought, Large Language Models, Multimodal Large Language Models, interleaved-modal reasoning, optical reasoning, typographic-based optical reasoning, graphical-based optical reasoning, visual rationales, token efficiency, code available, huggingface daily},
  eprint = {2606.09585},
  archiveprefix = {arXiv},
}

Metadata

{}