Paper Detail

Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift

Lixian Chen, Mingxuan Huang, Yanhui Chen, Junyi Lin, Yang Shi

Browse

Workflow Queues

arxiv Score 7.3

Published 2026-04-27 · First seen 2026-04-28

General AI

Open paper source

Abstract

Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modality may still dominate fusion. We study this failure mode through a majorization view of multimodal posteriors and cast adaptation as a constrained de-mixing problem on the fused prediction. Based on this view, we propose MG-MTTA, which keeps the backbone frozen and updates only a lightweight gate or adapter. The objective combines fused-posterior entropy minimization with a reliability-aware gate prior built from anchor-based modality consistency and cross-modal conflict. Our analysis gives conditions under which entropy reduction preserves the correct ranking and a threshold that characterizes modality-dominance failure. On the ImageNet-based benchmark, MG-MTTA improves top-1 accuracy from 57.97 to 66.51 under semantics-preserving textual shift and from 21.68 to 26.27 under joint visual-textual shift, while remaining competitive in the visual-only benchmark. These results show that multimodal test-time adaptation should control modality reliability, not just prediction entropy.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: soon
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{chen2026majorization,
  title = {Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift},
  author = {Lixian Chen and Mingxuan Huang and Yanhui Chen and Junyi Lin and Yang Shi},
  year = {2026},
  abstract = {Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modality may still dominate fusion. We study this failure mode through a majorization view of multimodal posteriors and cast adaptation as a constrained de-mixing problem on the fused prediction. Based on this view, we propose },
  url = {https://arxiv.org/abs/2604.24602},
  keywords = {cs.CV},
  eprint = {2604.24602},
  archiveprefix = {arXiv},
}

Metadata

{}