Paper Detail

MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production

Huanran Hu, Zihui Ren, Dingyi Yang, Liangyu Chen, Qixiang Gao, Tiezheng Ge, Qin Jin

Browse

Workflow Queues

arxiv Score 15.3

Published 2026-04-16 · First seen 2026-04-17

General AI

Open paper source

Abstract

Real-world video creation often involves a complex reasoning workflow of selecting relevant shots from noisy materials, planning missing shots for narrative completeness, and organizing them into coherent storylines. However, existing benchmarks focus on isolated sub-tasks and lack support for evaluating this full process. To address this gap, we propose Multimodal Context-to-Script Creation (MCSC), a new task that transforms noisy multimodal inputs and user instructions into structured, executable video scripts. We further introduce MCSC-Bench, the first large-scale MCSC dataset, comprising 11K+ well-annotated videos. Each sample includes: (1) redundant multimodal materials and user instructions; (2) a coherent, production-ready script containing material-based shots, newly planned shots (with shooting instructions), and shot-aligned voiceovers. MCSC-Bench supports comprehensive evaluation across material selection, narrative planning, and conditioned script generation, and includes both in-domain and out-of-domain test sets. Experiments show that current multimodal LLMs struggle with structure-aware reasoning under long contexts, highlighting the challenges posed by our benchmark. Models trained on MCSC-Bench achieve SOTA performance, with an 8B model surpassing Gemini-2.5-Pro, and generalize to out-of-domain scenarios. Downstream video generation guided by the generated scripts further validates the practical value of MCSC. Datasets are available at: https://github.com/huanran-hu/MCSC.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{hu2026mcsc,
  title = {MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production},
  author = {Huanran Hu and Zihui Ren and Dingyi Yang and Liangyu Chen and Qixiang Gao and Tiezheng Ge and Qin Jin},
  year = {2026},
  abstract = {Real-world video creation often involves a complex reasoning workflow of selecting relevant shots from noisy materials, planning missing shots for narrative completeness, and organizing them into coherent storylines. However, existing benchmarks focus on isolated sub-tasks and lack support for evaluating this full process. To address this gap, we propose Multimodal Context-to-Script Creation (MCSC), a new task that transforms noisy multimodal inputs and user instructions into structured, executa},
  url = {https://arxiv.org/abs/2604.15127},
  keywords = {cs.MM},
  eprint = {2604.15127},
  archiveprefix = {arXiv},
}

Metadata

{}