Paper Detail

Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

Haolei Xu, Haiwen Hong, Hongxing Li, Rui Zhou, Yang Zhang, Longtao Huang, Hui Xue, Yongliang Shen, Weiming Lu, Yueting Zhuang

arxiv Score 13.8

Published 2026-04-09 · First seen 2026-04-10

General AI

Abstract

Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. To validate this hypothesis, we design a routing-guided intervention method that enhances domain expert activation. Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks. Our analysis further reveals that domain expert identification locates cognitive functions rather than sample-specific solutions, enabling effective transfer across tasks with different information structures.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{xu2026seeing,
  title = {Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts},
  author = {Haolei Xu and Haiwen Hong and Hongxing Li and Rui Zhou and Yang Zhang and Longtao Huang and Hui Xue and Yongliang Shen and Weiming Lu and Yueting Zhuang},
  year = {2026},
  abstract = {Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We the},
  url = {https://arxiv.org/abs/2604.08541},
  keywords = {cs.CV, cs.AI, cs.CL, Computer science, Distraction, Domain (mathematical analysis), Artificial intelligence, Routing (electronic design automation)},
  eprint = {2604.08541},
  archiveprefix = {arXiv},
}

Metadata

{}