Paper Detail

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

Shuai Yang, Bingjie Gao, Ziwei Liu, Jiaqi Wang, Dahua Lin, Tong Wu

huggingface Score 9.5

Published 2026-06-15 · First seen 2026-06-16

General AI

Abstract

Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{yang2026permavid,
  title = {PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory},
  author = {Shuai Yang and Bingjie Gao and Ziwei Liu and Jiaqi Wang and Dahua Lin and Tong Wu},
  year = {2026},
  abstract = {Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearan},
  url = {https://huggingface.co/papers/2606.16449},
  keywords = {multi-modal context memory, spatial context, semantic appearance, geometric structure, edit-aware memory update, memory retrieval strategy, RGB context memory, depth context memory, multi-modal feature fusion, memory-guided video generation, code available, huggingface daily},
  eprint = {2606.16449},
  archiveprefix = {arXiv},
}

Metadata

{}