Paper Detail
Zhengwentai Sun, Keru Zheng, Chenghong Li, Hongjie Liao, Xihe Yang, Heyuan Li, Yihao Zhi, Shuliang Ning, Shuguang Cui, Xiaoguang Han
Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.
No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.
No ranking explanation is available yet.
No tags.
@misc{sun2026reimagine,
title = {ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis},
author = {Zhengwentai Sun and Keru Zheng and Chenghong Li and Hongjie Liao and Xihe Yang and Heyuan Li and Yihao Zhi and Shuliang Ning and Shuguang Cui and Xiaoguang Han},
year = {2026},
abstract = {Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consi},
url = {https://huggingface.co/papers/2604.19720},
keywords = {image generation, video diffusion models, SMPL-X, temporal refinement, canonical human dataset, compositional human image synthesis, code available, huggingface daily},
eprint = {2604.19720},
archiveprefix = {arXiv},
}
{}