Paper Detail

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Zhengwentai Sun, Keru Zheng, Chenghong Li, Hongjie Liao, Xihe Yang, Heyuan Li, Yihao Zhi, Shuliang Ning, Shuguang Cui, Xiaoguang Han

huggingface Score 4.5

Published 2026-04-21 · First seen 2026-04-23

General AI

Abstract

Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
later
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{sun2026reimagine,
  title = {ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis},
  author = {Zhengwentai Sun and Keru Zheng and Chenghong Li and Hongjie Liao and Xihe Yang and Heyuan Li and Yihao Zhi and Shuliang Ning and Shuguang Cui and Xiaoguang Han},
  year = {2026},
  abstract = {Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consi},
  url = {https://huggingface.co/papers/2604.19720},
  keywords = {image generation, video diffusion models, SMPL-X, temporal refinement, canonical human dataset, compositional human image synthesis, code available, huggingface daily},
  eprint = {2604.19720},
  archiveprefix = {arXiv},
}

Metadata

{}