Paper Detail

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Jiahao Wang, An Ping, Yanghai Wang, Yuanxing Zhang, Shihao Li, Hanyan Bian, Yichi Ren, Yize Zhang, Han Wang, Haowen Chen, Junze Li, Jiaqi Wang, Yiyang Hu, Zhuze Xu, Zijie Zhang, Jiaheng Liu

huggingface Score 14.5

Published 2026-06-07 · First seen 2026-06-09

General AI

Abstract

While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{wang2026omnicap,
  title = {OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning},
  author = {Jiahao Wang and An Ping and Yanghai Wang and Yuanxing Zhang and Shihao Li and Hanyan Bian and Yichi Ren and Yize Zhang and Han Wang and Haowen Chen and Junze Li and Jiaqi Wang and Yiyang Hu and Zhuze Xu and Zijie Zhang and Jiaheng Liu},
  year = {2026},
  abstract = {While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive b},
  url = {https://huggingface.co/papers/2606.08572},
  keywords = {Omni-modal Large Language Models, omni-modal captioning, instruction-following, Temporal Grounding, format correctness, content correctness, constraint types, multi-faceted user instructions, benchmark evaluation, format-content tradeoff, code available, huggingface daily},
  eprint = {2606.08572},
  archiveprefix = {arXiv},
}

Metadata

{}