Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

Abstract

We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: later
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{cha2026vanast,
  title = {Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision},
  author = {Hyunsoo Cha and Wonjung Woo and Byungjun Kim and Hanbyul Joo},
  year = {2026},
  abstract = {We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis},
  url = {https://arxiv.org/abs/2604.04934},
  keywords = {cs.CV, virtual try-on, pose-driven animation, garment transfer, video diffusion transformers, dual module architecture, triplet supervision, identity preservation, garment accuracy, pose adherence, code available, huggingface daily},
  eprint = {2604.04934},
  archiveprefix = {arXiv},
}

Metadata

{}