Paper Detail

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

Yuxuan Li, Lingxi Xie, Xinyue Huo, Jihao Qiu, Jiacheng Shao, Pengfei Chen, Jiannan Ge, Kaiwen Duan, Qi Tian

arxiv Score 16.6

Published 2026-07-02 · First seen 2026-07-03

General AI

Abstract

Long-form TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies on \textbf{speaker recognition}, the task of accurately attributing each spoken utterance to its respective character. In this paper, we advance this field through two primary contributions. (1) We introduce \textbf{DramaSR-532K}, a large-scale benchmark comprising 532K annotated dialogue lines across more than 900 unique characters, necessitating the integration of auditory, linguistic, and visual cues for speaker recognition. (2) We propose \textbf{DramaSR-LRM}, a robust approach built upon a large reasoning model (LRM). DramaSR-LRM is designed to autonomously aggregate contextual evidence via multimodal tool-use, synthesizing diverse inputs to achieve high-fidelity attribution. Experimental results demonstrate that DramaSR-LRM significantly outperforms existing baselines, particularly on short utterances where acoustic biometrics are inherently unreliable. \textit{All the data and code will be made publicly available at the project page: https://www.github.com/198808xc/DramaSR-LRM.}

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{li2026reasoning,
  title = {Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas},
  author = {Yuxuan Li and Lingxi Xie and Xinyue Huo and Jihao Qiu and Jiacheng Shao and Pengfei Chen and Jiannan Ge and Kaiwen Duan and Qi Tian},
  year = {2026},
  abstract = {Long-form TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies on \textbackslash{}textbf\{speaker recognition\}, the task of accurately attributing each spoken utterance to its respective character. In this paper, we advance this field through two primary contributions. (1) We introduce \textbackslash{}textbf\{DramaSR-532K\}, a large-scale benchmark comprising 532K annotated dialogue lines across more than 900 unique characters, necessitating the integra},
  url = {https://arxiv.org/abs/2607.02504},
  keywords = {cs.CL, cs.AI, cs.CV},
  eprint = {2607.02504},
  archiveprefix = {arXiv},
}

Metadata

{}