Paper Detail

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

Boyuan Sun, Bowen Yin, Yuanming Li, Xihan Wei, Qibin Hou

huggingface Score 14.0

Published 2026-05-18 · First seen 2026-05-25

General AI

Abstract

We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at https://github.com/HumanMLLM/SWIM{https://github.com/HumanMLLM/SWIM}.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{sun2026see,
  title = {See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding},
  author = {Boyuan Sun and Bowen Yin and Yuanming Li and Xihan Wei and Qibin Hou},
  year = {2026},
  abstract = {We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal lar},
  url = {https://huggingface.co/papers/2605.18018},
  keywords = {vision-language representations, cross-modal attention, multimodal large language models, natural language referring expressions, spatial consistency, multi-layer cross-attention maps, code available, huggingface daily},
  eprint = {2605.18018},
  archiveprefix = {arXiv},
}

Metadata

{}