Paper Detail

DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain

Song Jin, Juntian Zhang, Xun Zhang, Zeying Tian, Fei Jiang, Guojun Yin, Wei Lin, Yong Liu, Rui Yan

Browse

Workflow Queues

huggingface Score 9.5

Published 2026-04-12 · First seen 2026-04-14

General AI

Open paper source

Abstract

Recent advancements in Vision-Language Models (VLMs) have revolutionized general visual understanding. However, their application in the food domain remains constrained by benchmarks that rely on coarse-grained categories, single-view imagery, and inaccurate metadata. To bridge this gap, we introduce DiningBench, a hierarchical, multi-view benchmark designed to evaluate VLMs across three levels of cognitive complexity: Fine-Grained Classification, Nutrition Estimation, and Visual Question Answering. Unlike previous datasets, DiningBench comprises 3,021 distinct dishes with an average of 5.27 images per entry, incorporating fine-grained "hard" negatives from identical menus and rigorous, verification-based nutritional data. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary models. Our experiments reveal that while current VLMs excel at general reasoning, they struggle significantly with fine-grained visual discrimination and precise nutritional reasoning. Furthermore, we systematically investigate the impact of multi-view inputs and Chain-of-Thought reasoning, identifying five primary failure modes. DiningBench serves as a challenging testbed to drive the next generation of food-centric VLM research. All codes are released in https://github.com/meituan/DiningBench.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{jin2026diningbench,
  title = {DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain},
  author = {Song Jin and Juntian Zhang and Xun Zhang and Zeying Tian and Fei Jiang and Guojun Yin and Wei Lin and Yong Liu and Rui Yan},
  year = {2026},
  abstract = {Recent advancements in Vision-Language Models (VLMs) have revolutionized general visual understanding. However, their application in the food domain remains constrained by benchmarks that rely on coarse-grained categories, single-view imagery, and inaccurate metadata. To bridge this gap, we introduce DiningBench, a hierarchical, multi-view benchmark designed to evaluate VLMs across three levels of cognitive complexity: Fine-Grained Classification, Nutrition Estimation, and Visual Question Answer},
  url = {https://huggingface.co/papers/2604.10425},
  keywords = {Vision-Language Models, fine-grained classification, nutrition estimation, visual question answering, multi-view imaging, Chain-of-Thought reasoning, code available, huggingface daily},
  eprint = {2604.10425},
  archiveprefix = {arXiv},
}

Metadata

{}