Paper Detail

MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation

JoungBin Lee, Jaewoo Jung, Jongmin Lee, Tongmin Kim, Hyunsung Kim, Takuya Narihira, Kazumi Fukuda, Jahyeok Koo, Jisang Han, Yuki Mitsufuji, Seungryong Kim

arxiv Score 8.2

Published 2026-06-24 · First seen 2026-06-25

General AI

Abstract

Synthesizing a novel-view video from a monocular reference video along a target camera trajectory requires both geometric consistency and motion fidelity with respect to the reference video. Existing methods based on explicit 3D representations are limited by the accuracy of off-the-shelf reconstruction modules, which often produce inaccurate geometry for dynamic objects in monocular videos. In contrast, camera-conditioning-only methods can achieve high visual quality but often struggle to preserve geometric and motion consistency. In this work, we introduce MVTrack4Gen (Multi-View point Tracking for Novel-View Generation), a motion-aware training framework that leverages multi-view point tracking as an additional geometric and motion supervision signal for camera-conditioning-only novel-view video diffusion models. Our key finding is that specific attention layers encode strong correspondence cues, where query features attend to key features at geometrically corresponding locations across views and over time, and the misalignment of these correspondences causes motion inconsistency. Based on this observation, we route these features into an auxiliary multi-view tracking head and jointly train the diffusion model with a point-tracking objective. By explicitly strengthening these motion-aware correspondences, MVTrack4Gen improves existing models to better follow the motion in the reference view and maintain cross-view geometric consistency. Across diverse benchmarks, our method achieves state-of-the-art geometric consistency and competitive camera accuracy.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{lee2026mvtrack4gen,
  title = {MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation},
  author = {JoungBin Lee and Jaewoo Jung and Jongmin Lee and Tongmin Kim and Hyunsung Kim and Takuya Narihira and Kazumi Fukuda and Jahyeok Koo and Jisang Han and Yuki Mitsufuji and Seungryong Kim},
  year = {2026},
  abstract = {Synthesizing a novel-view video from a monocular reference video along a target camera trajectory requires both geometric consistency and motion fidelity with respect to the reference video. Existing methods based on explicit 3D representations are limited by the accuracy of off-the-shelf reconstruction modules, which often produce inaccurate geometry for dynamic objects in monocular videos. In contrast, camera-conditioning-only methods can achieve high visual quality but often struggle to prese},
  url = {https://arxiv.org/abs/2606.26087},
  keywords = {cs.CV, novel-view video synthesis, camera-conditioning-only, diffusion models, multi-view point tracking, geometric consistency, motion fidelity, attention layers, correspondence cues, auxiliary multi-view tracking head, joint training, cross-view geometric consistency, code available, huggingface daily},
  eprint = {2606.26087},
  archiveprefix = {arXiv},
}

Metadata

{}