Paper Detail

Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift

Lixian Chen, Mingxuan Huang, Yanhui Chen, Junyi Lin, Yang Shi

arxiv Score 8.3

Published 2026-04-27 · First seen 2026-04-28

General AI

Abstract

Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modality may still dominate fusion. We study this failure mode through a majorization view of multimodal posteriors and cast adaptation as a constrained de-mixing problem on the fused prediction. Based on this view, we propose MG-MTTA, which keeps the backbone frozen and updates only a lightweight gate or adapter. The objective combines fused-posterior entropy minimization with a reliability-aware gate prior built from anchor-based modality consistency and cross-modal conflict. Our analysis gives conditions under which entropy reduction preserves the correct ranking and a threshold that characterizes modality-dominance failure. On the ImageNet-based benchmark, MG-MTTA improves top-1 accuracy from 57.97 to 66.51 under semantics-preserving textual shift and from 21.68 to 26.27 under joint visual-textual shift, while remaining competitive in the visual-only benchmark. These results show that multimodal test-time adaptation should control modality reliability, not just prediction entropy.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{chen2026majorization,
  title = {Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift},
  author = {Lixian Chen and Mingxuan Huang and Yanhui Chen and Junyi Lin and Yang Shi},
  year = {2026},
  abstract = {Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modality may still dominate fusion. We study this failure mode through a majorization view of multimodal posteriors and cast adaptation as a constrained de-mixing problem on the fused prediction. Based on this view, we propose },
  url = {https://arxiv.org/abs/2604.24602},
  keywords = {cs.CV},
  eprint = {2604.24602},
  archiveprefix = {arXiv},
}

Metadata

{}