Paper Detail
Yuxuan Li, Lingxi Xie, Xinyue Huo, Jihao Qiu, Jiacheng Shao, Pengfei Chen, Jiannan Ge, Kaiwen Duan, Qi Tian
Long-form TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies on \textbf{speaker recognition}, the task of accurately attributing each spoken utterance to its respective character. In this paper, we advance this field through two primary contributions. (1) We introduce \textbf{DramaSR-532K}, a large-scale benchmark comprising 532K annotated dialogue lines across more than 900 unique characters, necessitating the integration of auditory, linguistic, and visual cues for speaker recognition. (2) We propose \textbf{DramaSR-LRM}, a robust approach built upon a large reasoning model (LRM). DramaSR-LRM is designed to autonomously aggregate contextual evidence via multimodal tool-use, synthesizing diverse inputs to achieve high-fidelity attribution. Experimental results demonstrate that DramaSR-LRM significantly outperforms existing baselines, particularly on short utterances where acoustic biometrics are inherently unreliable. \textit{All the data and code will be made publicly available at the project page: https://www.github.com/198808xc/DramaSR-LRM.}
No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.
No ranking explanation is available yet.
No tags.
@article{li2026reasoning,
title = {Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas},
author = {Yuxuan Li and Lingxi Xie and Xinyue Huo and Jihao Qiu and Jiacheng Shao and Pengfei Chen and Jiannan Ge and Kaiwen Duan and Qi Tian},
year = {2026},
abstract = {Long-form TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies on \textbackslash{}textbf\{speaker recognition\}, the task of accurately attributing each spoken utterance to its respective character. In this paper, we advance this field through two primary contributions. (1) We introduce \textbackslash{}textbf\{DramaSR-532K\}, a large-scale benchmark comprising 532K annotated dialogue lines across more than 900 unique characters, necessitating the integra},
url = {https://arxiv.org/abs/2607.02504},
keywords = {cs.CL, cs.AI, cs.CV},
eprint = {2607.02504},
archiveprefix = {arXiv},
}
{}