Paper Detail

UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

Jiehui Huang, Yuechen Zhang, Bin Xia, Jiahao Wang, Xu He, Zhenchao Tang, Meng Chu, Xin Tao, Pengfei Wan, Jiaya Jia

huggingface Score 10.0

Published 2026-06-19 · First seen 2026-06-25

General AI

Abstract

Generating a coherent multi-shot video requires structured cross-shot memory. Subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow linearly, or orchestrate pretrained generators under an LLM planner without a multi-shot-aware backbone. We present UnityShots, a memory-driven multi-shot audio-video generation system built on LTX-2.3, trained on annotated cinematic and music-video shots. The video stream maintains two fixed-size slots, a long-term memory (LTM) slot anchored to the opening shot and a short-term memory (STM) slot holding the immediately preceding tail, both updated at every cut by a boundary-conditioned gate that fuses visual cut probability and beat-tracker signals. The audio stream injects a reference speaker token at every shot to preserve vocal timbre without a sliding audio bank. A discrete cut-type prior, learned through AdaLN, becomes an inference-time control knob over transition strength. We release a benchmark of 200 multi-cultural multi-shot sequences spanning six ethnic regions and ten or more languages, with per-shot reference identities, reference audio, and per-boundary transition labels. Evaluated across I2V, T2V, and R2V conditioning modes, UnityShots leads open-source baselines on every cross-shot coherence metric and matches the strongest closed-source system on the multi-shot axes.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{huang2026unityshots,
  title = {UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating},
  author = {Jiehui Huang and Yuechen Zhang and Bin Xia and Jiahao Wang and Xu He and Zhenchao Tang and Meng Chu and Xin Tao and Pengfei Wan and Jiaya Jia},
  year = {2026},
  abstract = {Generating a coherent multi-shot video requires structured cross-shot memory. Subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow linearly, or orchestrate pretrained generators under an LLM planner without a multi-shot-aware backbone. We present UnityShots, a memory-driven multi-shot audio-video generation system built on LTX-2.},
  url = {https://huggingface.co/papers/2606.21661},
  keywords = {multi-shot audio-video generation, LTX-2.3, long-term memory, short-term memory, boundary-conditioned gate, visual cut probability, beat-tracker signals, reference speaker token, discrete cut-type prior, AdaLN, cross-shot coherence, code available, huggingface daily},
  eprint = {2606.21661},
  archiveprefix = {arXiv},
}

Metadata

{}