Paper Detail

Understanding Cross-Modal Contributions in Continual Vision-Language Models: A Theoretical Perspective

Salimeh Sekeh, Mary Wisell

arxiv Score 15.0

Published 2026-06-12 · First seen 2026-06-16

Research Track A · General AI

Abstract

Continual vision-language models are commonly addressed through sequential fine-tuning; however, although this paradigm enables adaptation to new environments (tasks), it inherently emphasizes the contribution of previously learned environments (tasks) at the expense of the stability required to preserve previously acquired knowledge. While existing approaches have adequately studied continual learning and catastrophic forgetting in vision-language models (VLMs), the theoretical understanding of modality-specific contributions across a sequence of environments remains largely unexplored. In this paper, we present a new theoretical perspective to understand the cross-modal (vision-language) contributions to consecutive environments. We empirically evaluate our theoretical findings on large VLMs and demonstrate their effectiveness in capturing environment-level cross-modal contributions. Our analysis provides deeper insights into continual VLMs, highlighting their contribution robustness to varying task orders and inter-task similarities, and their improved generalization performance.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{sekeh2026understanding,
  title = {Understanding Cross-Modal Contributions in Continual Vision-Language Models: A Theoretical Perspective},
  author = {Salimeh Sekeh and Mary Wisell},
  year = {2026},
  abstract = {Continual vision-language models are commonly addressed through sequential fine-tuning; however, although this paradigm enables adaptation to new environments (tasks), it inherently emphasizes the contribution of previously learned environments (tasks) at the expense of the stability required to preserve previously acquired knowledge. While existing approaches have adequately studied continual learning and catastrophic forgetting in vision-language models (VLMs), the theoretical understanding of},
  url = {https://arxiv.org/abs/2606.14883},
  keywords = {cs.CV, cs.LG},
  eprint = {2606.14883},
  archiveprefix = {arXiv},
}

Metadata

{}