Paper Detail
Zhiqi Li, Chengrui Dong, Zhenhua Du, Hangning Zhou, Cong Qiu, Hailong Qin, Mu Yang, Dongxu Wei, Peidong Liu
Interactive video generation systems for camera-controlled world exploration roll out growing sequences of latent video frames, entangling state transition with high-frequency observation synthesis. We propose Walking in the Implicit, a scene-centric paradigm that changes the rollout variable from frame latents to a fixed-length, renderable implicit state, termed Neural Implicit Scene (NIS). This factorizes interactive generation into stochastic transition of a compact scene state and deterministic pose-conditioned rendering given the sampled state. We instantiate this paradigm as NeuWorld: a transformer VAE learns locally anchored NIS from sparse posed frames, and a diffusion transformer evolves NIS conditioned on future camera trajectories and geometry-aware retrieved history. By reusing the VAE encoder as a unified conditioner, NeuWorld maps camera, reference-image, and history cues into the same NIS modality, avoiding external heterogeneous encoders. Trained from scratch on public posed-view data without pretrained video backbones or auxiliary 3D reconstructors, NeuWorld achieves strong long-horizon consistency with favorable inference efficiency.
No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.
No ranking explanation is available yet.
No tags.
@misc{li2026walking,
title = {Walking in the Implicit: Interactive World Exploration via Neural Scene Representation},
author = {Zhiqi Li and Chengrui Dong and Zhenhua Du and Hangning Zhou and Cong Qiu and Hailong Qin and Mu Yang and Dongxu Wei and Peidong Liu},
year = {2026},
abstract = {Interactive video generation systems for camera-controlled world exploration roll out growing sequences of latent video frames, entangling state transition with high-frequency observation synthesis. We propose Walking in the Implicit, a scene-centric paradigm that changes the rollout variable from frame latents to a fixed-length, renderable implicit state, termed Neural Implicit Scene (NIS). This factorizes interactive generation into stochastic transition of a compact scene state and determinis},
url = {https://huggingface.co/papers/2606.30045},
keywords = {latent video frames, implicit state, Neural Implicit Scene, transformer VAE, diffusion transformer, pose-conditioned rendering, camera trajectories, geometry-aware retrieval, VAE encoder, unified conditioner, long-horizon consistency, code available, huggingface daily},
eprint = {2606.30045},
archiveprefix = {arXiv},
}
{}