Paper Detail

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

Team HY-World, Chenjie Cao, Xuhui Zuo, Zhenwei Wang, Yisu Zhang, Junta Wu, Zhenyang Liu, Yuning Gong, Yang Liu, Bo Yuan, Chao Zhang, Coopers Li, Dongyuan Guo, Fan Yang, Haiyu Zhang, Hang Cao, Jianchen Zhu, Jiaxin Lin, Jie Xiao, Jihong Zhang, Junlin Yu, Lei Wang, Lifu Wang, Lilin Wang, Linus, Minghui Chen, Peng He, Penghao Zhao, Qi Chen, Rui Chen, Rui Shao, Sicong Liu, Wangchen Qin, Xiaochuan Niu, Xiang Yuan, Yi Sun, Yifei Tang, Yifu Sun, Yihang Lian, Yonghao Tan, Yuhong Liu, Yuyang Yin, Zhiyuan Min, Tengfei Wang, Chunchao Guo

Browse

Workflow Queues

huggingface Score 9.5

Published 2026-04-15 · First seen 2026-04-17

General AI

Open paper source

Abstract

We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with HY-Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe-based view generation model with consistent memory. We also upgrade WorldMirror, a feed-forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi-view images or videos. Also, we introduce WorldLens, a high-performance 3DGS rendering platform featuring a flexible engine-agnostic architecture, automatic IBL lighting, efficient collision detection, and training-rendering co-design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY-World 2.0 achieves state-of-the-art performance on several benchmarks among open-source approaches, delivering results comparable to the closed-source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{hyworld2026hy,
  title = {HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds},
  author = {Team HY-World and Chenjie Cao and Xuhui Zuo and Zhenwei Wang and Yisu Zhang and Junta Wu and Zhenyang Liu and Yuning Gong and Yang Liu and Bo Yuan and Chao Zhang and Coopers Li and Dongyuan Guo and Fan Yang and Haiyu Zhang and Hang Cao and Jianchen Zhu and Jiaxin Lin and Jie Xiao and Jihong Zhang and Junlin Yu and Lei Wang and Lifu Wang and Lilin Wang and Linus and Minghui Chen and Peng He and Penghao Zhao and Qi Chen and Rui Chen and Rui Shao and Sicong Liu and Wangchen Qin and Xiaochuan Niu and Xiang Yuan and Yi Sun and Yifei Tang and Yifu Sun and Yihang Lian and Yonghao Tan and Yuhong Liu and Yuyang Yin and Zhiyuan Min and Tengfei Wang and Chunchao Guo},
  year = {2026},
  abstract = {We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with },
  url = {https://huggingface.co/papers/2604.14268},
  keywords = {multi-modal world model, 3D Gaussian Splatting, HY-Pano 2.0, WorldNav, WorldStereo 2.0, WorldMirror 2.0, keyframe-based view generation, feed-forward model, 3D world representations, interactive exploration, rendering platform, code available, huggingface daily},
  eprint = {2604.14268},
  archiveprefix = {arXiv},
}

Metadata

{}