Paper Detail

AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models

Rintaro Otsubo, Ryo Fujii, Reina Ishikawa, Taiki Kanaya, Kanta Sawafuji, Hiroki Kajita, Shigeki Sakai, Hideo Saito, Ryo Hachiuma

Browse

Workflow Queues

huggingface Score 12.8

Published 2026-07-02 · First seen 2026-07-03

Research Track A · General AI

Open paper source

Abstract

Vision-Language Models (VLMs) have demonstrated immense promise in Spatio-Temporal Video Grounding (STVG). However, current evaluation protocols are largely confined to zero-shot assessments on general, daily-life benchmarks. This creates a critical disconnect from real-world applications in specialized fields, where models inevitably encounter rare visual concepts and complex spatio-temporal dynamics. Since exhaustive pre-training across infinite data distributions is infeasible, the ability to adapt to novel domains is essential. To bridge this gap, we introduce AnyGroundBench, a domain-adaptation benchmark designed to shift the STVG evaluation paradigm from static zero-shot testing to rigorous domain adaptation. Targeting five specialized domains (animal, industry, sports, surgery, and public security), AnyGroundBench pairs newly captured videos such as expert-annotated mouse behaviors with established datasets, unifying them through dense, high-fidelity spatio-temporal annotations. Crucially, the benchmark provides dedicated training subsets to systematically measure domain adaptability. We extensively evaluate 15 state-of-the-art VLMs, assessing their zero-shot generalization and In-Context Learning (ICL) capabilities under practical computational constraints. Ultimately, our findings reveal that current models fail in both zero-shot and ICL-based adaptation when confronted with specialized domains, exposing critical flaws in spatio-temporal reasoning that future research must address.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{otsubo2026anygroundbench,
  title = {AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models},
  author = {Rintaro Otsubo and Ryo Fujii and Reina Ishikawa and Taiki Kanaya and Kanta Sawafuji and Hiroki Kajita and Shigeki Sakai and Hideo Saito and Ryo Hachiuma},
  year = {2026},
  abstract = {Vision-Language Models (VLMs) have demonstrated immense promise in Spatio-Temporal Video Grounding (STVG). However, current evaluation protocols are largely confined to zero-shot assessments on general, daily-life benchmarks. This creates a critical disconnect from real-world applications in specialized fields, where models inevitably encounter rare visual concepts and complex spatio-temporal dynamics. Since exhaustive pre-training across infinite data distributions is infeasible, the ability to},
  url = {https://huggingface.co/papers/2607.02269},
  keywords = {Vision-Language Models, Spatio-Temporal Video Grounding, domain adaptation, In-Context Learning, zero-shot generalization, code available, huggingface daily},
  eprint = {2607.02269},
  archiveprefix = {arXiv},
}

Metadata

{}