Paper Detail

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

Yuxuan Li, Lingxi Xie, Xinyue Huo, Jihao Qiu, Jiacheng Shao, Pengfei Chen, Jiannan Ge, Kaiwen Duan, Qi Tian

Browse

Workflow Queues

arxiv Score 16.6

Published 2026-07-02 · First seen 2026-07-03

General AI

Open paper source

Abstract

Long-form TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies on \textbf{speaker recognition}, the task of accurately attributing each spoken utterance to its respective character. In this paper, we advance this field through two primary contributions. (1) We introduce \textbf{DramaSR-532K}, a large-scale benchmark comprising 532K annotated dialogue lines across more than 900 unique characters, necessitating the integration of auditory, linguistic, and visual cues for speaker recognition. (2) We propose \textbf{DramaSR-LRM}, a robust approach built upon a large reasoning model (LRM). DramaSR-LRM is designed to autonomously aggregate contextual evidence via multimodal tool-use, synthesizing diverse inputs to achieve high-fidelity attribution. Experimental results demonstrate that DramaSR-LRM significantly outperforms existing baselines, particularly on short utterances where acoustic biometrics are inherently unreliable. \textit{All the data and code will be made publicly available at the project page: https://www.github.com/198808xc/DramaSR-LRM.}

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{li2026reasoning,
  title = {Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas},
  author = {Yuxuan Li and Lingxi Xie and Xinyue Huo and Jihao Qiu and Jiacheng Shao and Pengfei Chen and Jiannan Ge and Kaiwen Duan and Qi Tian},
  year = {2026},
  abstract = {Long-form TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies on \textbackslash{}textbf\{speaker recognition\}, the task of accurately attributing each spoken utterance to its respective character. In this paper, we advance this field through two primary contributions. (1) We introduce \textbackslash{}textbf\{DramaSR-532K\}, a large-scale benchmark comprising 532K annotated dialogue lines across more than 900 unique characters, necessitating the integra},
  url = {https://arxiv.org/abs/2607.02504},
  keywords = {cs.CL, cs.AI, cs.CV},
  eprint = {2607.02504},
  archiveprefix = {arXiv},
}

Metadata

{}