Paper Detail

TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

Bingyi Cao, Koert Chen, Kevis-Kokitsi Maninis, Kaifeng Chen, Arjun Karpur, Ye Xia, Sahil Dua, Tanmaya Dabral, Guangxing Han, Bohyung Han, Joshua Ainslie, Alex Bewley, Mithun Jacob, René Wagner, Washington Ramos, Krzysztof Choromanski, Mojtaba Seyedhosseini, Howard Zhou, André Araujo

huggingface Score 6.5

Published 2026-04-13 · First seen 2026-04-20

General AI

Abstract

Recent progress in vision-language pretraining has enabled significant improvements to many downstream computer vision applications, such as classification, retrieval, segmentation and depth prediction. However, a fundamental capability that these models still struggle with is aligning dense patch representations with text embeddings of corresponding concepts. In this work, we investigate this critical issue and propose novel techniques to enhance this capability in foundational vision-language models. First, we reveal that a patch-level distillation procedure significantly boosts dense patch-text alignment -- surprisingly, the patch-text alignment of the distilled student model strongly surpasses that of the teacher model. This observation inspires us to consider modifications to pretraining recipes, leading us to propose iBOT++, an upgrade to the commonly-used iBOT masked image objective, where unmasked tokens also contribute directly to the loss. This dramatically enhances patch-text alignment of pretrained models. Additionally, to improve vision-language pretraining efficiency and effectiveness, we modify the exponential moving average setup in the learning recipe, and introduce a caption sampling strategy to benefit from synthetic captions at different granularities. Combining these components, we develop TIPSv2, a new family of image-text encoder models suitable for a wide range of downstream applications. Through comprehensive experiments on 9 tasks and 20 datasets, we demonstrate strong performance, generally on par with or better than recent vision encoder models. Code and models are released via our project page at https://gdm-tipsv2.github.io/ .

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{cao2026tipsv2,
  title = {TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment},
  author = {Bingyi Cao and Koert Chen and Kevis-Kokitsi Maninis and Kaifeng Chen and Arjun Karpur and Ye Xia and Sahil Dua and Tanmaya Dabral and Guangxing Han and Bohyung Han and Joshua Ainslie and Alex Bewley and Mithun Jacob and René Wagner and Washington Ramos and Krzysztof Choromanski and Mojtaba Seyedhosseini and Howard Zhou and André Araujo},
  year = {2026},
  abstract = {Recent progress in vision-language pretraining has enabled significant improvements to many downstream computer vision applications, such as classification, retrieval, segmentation and depth prediction. However, a fundamental capability that these models still struggle with is aligning dense patch representations with text embeddings of corresponding concepts. In this work, we investigate this critical issue and propose novel techniques to enhance this capability in foundational vision-language },
  url = {https://huggingface.co/papers/2604.12012},
  keywords = {vision-language pretraining, dense patch representations, text embeddings, patch-level distillation, iBOT++, masked image objective, exponential moving average, caption sampling, image-text encoder models, downstream applications, huggingface daily},
  eprint = {2604.12012},
  archiveprefix = {arXiv},
}

Metadata

{}