Paper Detail

MLLMs Get It Right, Then Get It Wrong: Tracing and Correcting Late-Layer Textual Bias

Xingming Li, Ao Cheng, Qiyao Sun, Xixiang He, Xuanyu Ji, Runke Huang, Qingyong Hu

arxiv Score 12.3

Published 2026-06-16 · First seen 2026-06-17

General AI

Abstract

When vision contradicts text, multimodal large language models (MLLMs) consistently favor text, even when images provide clear evidence otherwise. This bias poses risks for applications requiring visual grounding, yet its cause remains unclear. In this paper, we uncover a surprising finding: models often get it right initially, forming correct vision-based predictions in their intermediate layers, before changing their minds and favoring text in the final output. We call this "late-layer textual override". The visual information is encoded, it simply does not survive to the output. More intriguingly, we find that how predictions change reveals whether they're correct: 85% of failures shift toward text, while 89% of successes shift toward vision. This directional signature enables a simple but powerful intervention: when we detect a confident visual prediction being suppressed, we restore it. We propose CALRD (Conflict-Aware Layer Reference Decoding), a training-free method that recovers overridden predictions at inference time. Experiments across five MLLMs of varying architectures demonstrate up to 9.4% absolute improvements on conflict benchmarks while largely preserving standard performance, without training or external knowledge. It recovers what the model already knew but failed to preserve.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{li2026mllms,
  title = {MLLMs Get It Right, Then Get It Wrong: Tracing and Correcting Late-Layer Textual Bias},
  author = {Xingming Li and Ao Cheng and Qiyao Sun and Xixiang He and Xuanyu Ji and Runke Huang and Qingyong Hu},
  year = {2026},
  abstract = {When vision contradicts text, multimodal large language models (MLLMs) consistently favor text, even when images provide clear evidence otherwise. This bias poses risks for applications requiring visual grounding, yet its cause remains unclear. In this paper, we uncover a surprising finding: models often get it right initially, forming correct vision-based predictions in their intermediate layers, before changing their minds and favoring text in the final output. We call this "late-layer textual},
  url = {https://arxiv.org/abs/2606.17953},
  keywords = {cs.CV},
  eprint = {2606.17953},
  archiveprefix = {arXiv},
}

Metadata

{}