Paper Detail

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, Dasen Dai, Bolin Jiang, Manyuan Zhang, Shi-Xue Zhang, Zhengkai Jiang, Lucas Wang, Zhao Zhong, Yu Cheng, Nanyun Peng

huggingface Score 16.0

Published 2026-03-31 · First seen 2026-04-01

General AI

Abstract

Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{chen2026unify,
  title = {Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis},
  author = {Shuang Chen and Quanxin Shou and Hangting Chen and Yucheng Zhou and Kaituo Feng and Wenbo Hu and Yi-Fan Zhang and Yunlong Lin and Wenxuan Huang and Mingyang Song and Dasen Dai and Bolin Jiang and Manyuan Zhang and Shi-Xue Zhang and Zhengkai Jiang and Lucas Wang and Zhao Zhong and Yu Cheng and Nanyun Peng},
  year = {2026},
  abstract = {Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agen},
  url = {https://huggingface.co/papers/2603.29620},
  keywords = {unified multimodal models, agentic modeling, world-grounded image synthesis, prompt understanding, multimodal evidence searching, grounded recaptioning, agent trajectories, FactIP, closed-source models, code available, huggingface daily},
  eprint = {2603.29620},
  archiveprefix = {arXiv},
}

Metadata

{}