Paper Detail

Let ViT Speak: Generative Language-Image Pre-training

Yan Fang, Mengcheng Lan, Zilong Huang, Weixian Lei, Yunqing Zhao, Yujie Zhong, Yingchen Yu, Qi She, Yao Zhao, Yunchao Wei

Browse

Workflow Queues

arxiv Score 11.3

Published 2026-05-01 · First seen 2026-05-04

General AI

Open paper source

Abstract

In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) \textbf{Simplicity}: a single transformer jointly models visual and textual tokens; (2) \textbf{Scalability}: it scales effectively with both data and model size; and (3) \textbf{Performance}: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{fang2026let,
  title = {Let ViT Speak: Generative Language-Image Pre-training},
  author = {Yan Fang and Mengcheng Lan and Zilong Huang and Weixian Lei and Yunqing Zhao and Yujie Zhong and Yingchen Yu and Qi She and Yao Zhao and Yunchao Wei},
  year = {2026},
  abstract = {In this paper, we present \textbackslash{}textbf\{Gen\}erative \textbackslash{}textbf\{L\}anguage-\textbackslash{}textbf\{I\}mage \textbackslash{}textbf\{P\}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text dec},
  url = {https://arxiv.org/abs/2605.00809},
  keywords = {cs.CV, Vision Transformers, multimodal large language models, autoregressive nature, language modeling objective, visual tokens, language tokens, transformer, multimodal benchmarks, Recap-DataComp-1B, multi-resolution images, native aspect ratios, OCR, chart understanding, code available, huggingface daily},
  eprint = {2605.00809},
  archiveprefix = {arXiv},
}

Metadata

{}