Paper Detail

Geo-Align: Video Generation Alignment via Metric Geometry Reward

Zizun Li, Haoyu Guo, Runzhe Teng, Chunhua Shen, Tong He

arxiv Score 6.6

Published 2026-05-22 · First seen 2026-05-25

General AI

Abstract

Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Supervised Fine-Tuning using synthetic datasets. At present, there is an extreme scarcity of synchronized, multi-view real-world video data. Consequently, the prevailing paradigm often exhibits limited generalization when processing out-of-distribution real-world videos, with models struggling to accurately adhere to physical scales and camera trajectories. To bridge this gap, we propose Geo-Align, the first Reinforcement Learning framework specifically designed for camera-controlled video re-rendering. Built upon a pretrained model, we optimize the model through a scale-aware perceptual reward mechanism. Specifically, we introduce a metric 3D estimator to extract precise camera trajectories from generated videos, explicitly penalizing deviations in rotation and translation. Furthermore, we meticulously designed a data pipeline strategy based on real-world conditioning videos and target camera trajectories derived from synthetic data, eliminating the reliance on paired data. Extensive experiments demonstrate that Geo-Align consistently outperforms existing supervised learning baselines in both precise camera controllability and visual fidelity, indicating the effectiveness of our method.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
later
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{li2026geo,
  title = {Geo-Align: Video Generation Alignment via Metric Geometry Reward},
  author = {Zizun Li and Haoyu Guo and Runzhe Teng and Chunhua Shen and Tong He},
  year = {2026},
  abstract = {Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Supervised Fine-Tuning using synthetic datasets. At present, there is an extreme scarcity of synchronized, multi-view real-world video data. Consequently, the prevailing paradigm often exhibits limited generalization when processing out-of-distribution real-world videos, with models struggling to accurately adhere to physical scales and came},
  url = {https://arxiv.org/abs/2605.23903},
  keywords = {cs.CV, Reinforcement Learning, camera-controlled video re-rendering, scale-aware perceptual reward, metric 3D estimator, camera trajectories, supervised fine-tuning, synthetic datasets, real-world video data, pretrained model, data pipeline strategy, code available, huggingface daily},
  eprint = {2605.23903},
  archiveprefix = {arXiv},
}

Metadata

{}