Paper Detail

Poller: Are LLMs Suitable for Evaluating the Poetry Understanding Task?

Shanshan Wang, Derek F. Wong, Jingming Yao, Lidia S. Chao

arxiv Score 8.8

Published 2026-06-29 · First seen 2026-06-30

General AI

Abstract

Traditional automatic evaluation methods have been shown to be unsuitable for modern Chinese poetry because of the distinct nature of this literary genre. Human evaluation remains reliable, but is expensive and not applicable to large-scale data. In this paper, we propose Poller (Poetry LLM Evaluator), a novel method leveraging large language models (LLMs) to evaluate the poetry understanding task. Specifically, our method requires LLMs to play the role of a poem's author with detailed information, thereby emulating human evaluation and judgment by adopting the poet's perspective. We conducted comprehensive experiments on multiple LLMs, evaluating the interpretations of poems across eight specialized dimensions. Experimental results demonstrate that our method effectively reduces the evaluation error between LLMs and humans. Especially for specific dimension evaluation, Poller-based LLMs achieve a 94.55% and 89.53% error reduction for rhetorical techniques and defamiliarization, respectively, compared to baseline methods. These performances are unattainable by conventional LLM evaluation methods. Experimental results from multiple LLMs across various dimensions validate the efficacy of our method. This work bridges the gap between automated efficiency and human expertise, establishing a foundation for automated evaluation in poetry-related tasks.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{wang2026poller,
  title = {Poller: Are LLMs Suitable for Evaluating the Poetry Understanding Task?},
  author = {Shanshan Wang and Derek F. Wong and Jingming Yao and Lidia S. Chao},
  year = {2026},
  abstract = {Traditional automatic evaluation methods have been shown to be unsuitable for modern Chinese poetry because of the distinct nature of this literary genre. Human evaluation remains reliable, but is expensive and not applicable to large-scale data. In this paper, we propose Poller (Poetry LLM Evaluator), a novel method leveraging large language models (LLMs) to evaluate the poetry understanding task. Specifically, our method requires LLMs to play the role of a poem's author with detailed informati},
  url = {https://arxiv.org/abs/2606.30556},
  keywords = {cs.CL},
  eprint = {2606.30556},
  archiveprefix = {arXiv},
}

Metadata

{}