Paper Detail

Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation

Gordon Chen, Ziqi Huang, Ziwei Liu

huggingface Score 4.5

Published 2026-04-11 · First seen 2026-04-14

General AI

Abstract

Video diffusion models have achieved remarkable progress in generating high-quality videos. However, these models struggle to represent the temporal succession of multiple events in real-world videos and lack explicit mechanisms to control when semantic concepts appear, how long they persist, and the order in which multiple events occur. Such control is especially important for movie-grade video synthesis, where coherent storytelling depends on precise timing, duration, and transitions between events. When using a single paragraph-style prompt to describe a sequence of complex events, models often exhibit semantic entanglement, where concepts intended for different moments in the video bleed into one another, resulting in poor text-video alignment. To address these limitations, we propose Prompt Relay, an inference-time, plug-and-play method to enable fine-grained temporal control in multi-event video generation, requiring no architectural modifications and no additional computational overhead. Prompt Relay introduces a penalty into the cross-attention mechanism, so that each temporal segment attends only to its assigned prompt, allowing the model to represent one semantic concept at a time and thereby improving temporal prompt alignment, reducing semantic interference, and enhancing visual quality.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
later
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{chen2026prompt,
  title = {Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation},
  author = {Gordon Chen and Ziqi Huang and Ziwei Liu},
  year = {2026},
  abstract = {Video diffusion models have achieved remarkable progress in generating high-quality videos. However, these models struggle to represent the temporal succession of multiple events in real-world videos and lack explicit mechanisms to control when semantic concepts appear, how long they persist, and the order in which multiple events occur. Such control is especially important for movie-grade video synthesis, where coherent storytelling depends on precise timing, duration, and transitions between e},
  url = {https://huggingface.co/papers/2604.10030},
  keywords = {video diffusion models, temporal succession, cross-attention mechanism, semantic concepts, temporal control, multi-event video generation, text-video alignment, semantic entanglement, prompt relay, temporal segments, visual quality, code available, huggingface daily},
  eprint = {2604.10030},
  archiveprefix = {arXiv},
}

Metadata

{}