Paper Detail

Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir, Qianqian Xie, Stavroula Golfomitsou, Konstantinos Arvanitis, Sophia Ananiadou

Browse

Workflow Queues

huggingface Score 9.5

Published 2026-04-08 · First seen 2026-04-11

General AI

Open paper source

Abstract

Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{jiang2026appear2meaning,
  title = {Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images},
  author = {Yuechen Jiang and Enze Zhang and Md Mohsinul Kabir and Qianqian Xie and Stavroula Golfomitsou and Konstantinos Arvanitis and Sophia Ananiadou},
  year = {2026},
  abstract = {Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accurac},
  url = {https://huggingface.co/papers/2604.07338},
  keywords = {vision-language models, cultural heritage, structured cultural metadata, LLM-as-Judge framework, semantic alignment, exact-match accuracy, partial-match accuracy, attribute-level accuracy, huggingface daily},
  eprint = {2604.07338},
  archiveprefix = {arXiv},
}

Metadata

{}