Paper Detail

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Keming Wu, Zuhao Yang, Kaichen Zhang, Shizun Wang, Haowei Zhu, Sicong Leng, Zhongyu Yang, Qijie Wang, Sudong Wang, Ziting Wang, Zili Wang, Hui Zhang, Haonan Wang, Hang Zhou, Yifan Pu, Xingxuan Li, Fangneng Zhan, Bo Li, Lidong Bing, Yuxin Song, Ziwei Liu, Wenhu Chen, Jingdong Wang, Xinchao Wang, Xiaojuan Qi, Shijian Lu, Bin Wang

arxiv Score 15.2

Published 2026-04-30 · First seen 2026-05-01

General AI

Abstract

Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{wu2026visual,
  title = {Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling},
  author = {Keming Wu and Zuhao Yang and Kaichen Zhang and Shizun Wang and Haowei Zhu and Sicong Leng and Zhongyu Yang and Qijie Wang and Sudong Wang and Ziting Wang and Zili Wang and Hui Zhang and Haonan Wang and Hang Zhou and Yifan Pu and Xingxuan Li and Fangneng Zhan and Bo Li and Lidong Bing and Yuxin Song and Ziwei Liu and Wenhu Chen and Jingdong Wang and Xinchao Wang and Xiaojuan Qi and Shijian Lu and Bin Wang},
  year = {2026},
  abstract = {Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy},
  url = {https://arxiv.org/abs/2604.28185},
  keywords = {cs.CV, visual generation models, photorealism, spatial reasoning, long-horizon consistency, causal understanding, flow matching, unified understanding-and-generation models, visual representations, post-training, reward modeling, data curation, synthetic data distillation, sampling acceleration, benchmark review, stress tests, case studies, huggingface daily},
  eprint = {2604.28185},
  archiveprefix = {arXiv},
}

Metadata

{}