Paper Detail

Multilingual Training and Evaluation Resources for Vision-Language Models

Daniela Baiamonte, Elena Fano, Matteo Gabburo, Stefano Simonazzi, Leonardo Rigutini, Andrea Zugarini

Browse

Workflow Queues

arxiv Score 10.3

Published 2026-04-20 · First seen 2026-04-21

General AI

Open paper source

Abstract

Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English, French, German, Italian, and Spanish). We adopt a regeneration-translation paradigm that produces high-quality cross-lingual resources by combining curated synthetic generation and manual annotation. Specifically, we build Multi-PixMo, a training corpus obtained regenerating examples from Pixmo pre-existing datasets with permissively licensed models: PixMo-Cap, PixMo-AskModelAnything, and CoSyn-400k. On the evaluation side, we construct a set of multilingual benchmarks derived translating widely used English datasets (MMbench, ScienceQA, MME, POPE, AI2D). We assess the quality of these resources through qualitative and quantitative human analyses, measuring inter-annotator agreement. Additionally, we perform ablation studies to demonstrate the impact of multilingual data, with respect to English only, in VLMs training. Experiments, comprising 3 different models show that using multilingual, multimodal examples for training VLMs aids is consistently beneficial on non-English benchmarks, with positive transfer to English as well.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{baiamonte2026multilingual,
  title = {Multilingual Training and Evaluation Resources for Vision-Language Models},
  author = {Daniela Baiamonte and Elena Fano and Matteo Gabburo and Stefano Simonazzi and Leonardo Rigutini and Andrea Zugarini},
  year = {2026},
  abstract = {Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English, },
  url = {https://arxiv.org/abs/2604.18347},
  keywords = {cs.CL, cs.AI},
  eprint = {2604.18347},
  archiveprefix = {arXiv},
}

Metadata

{}