Paper Detail

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

Junfu Pu, Yuxin Chen, Teng Wang, Ying Shan

huggingface Score 21.5

Published 2026-04-13 · First seen 2026-04-21

General AI

Abstract

Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using temporally segmented rewards. Extensive experiments demonstrate that despite its parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models, including Gemini 3-Pro, in both temporal localization and multi-field semantic accuracy.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{pu2026omniscript,
  title = {OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video},
  author = {Junfu Pu and Yuxin Chen and Teng Wang and Ying Shan},
  year = {2026},
  abstract = {Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotate},
  url = {https://huggingface.co/papers/2604.11102},
  keywords = {multimodal large language models, video-to-script, hierarchical evaluation framework, omni-modal language model, progressive pipeline, chain-of-thought supervised fine-tuning, reinforcement learning, temporal localization, multi-field semantic accuracy, huggingface daily},
  eprint = {2604.11102},
  archiveprefix = {arXiv},
}

Metadata

{}