Paper Detail

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, Dasen Dai, Bolin Jiang, Manyuan Zhang, Shi-Xue Zhang, Zhengkai Jiang, Lucas Wang, Zhao Zhong, Yu Cheng, Nanyun Peng

Browse

Workflow Queues

huggingface Score 15.5

Published 2026-03-31 · First seen 2026-04-01

General AI

Open paper source

Abstract

Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{chen2026unify,
  title = {Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis},
  author = {Shuang Chen and Quanxin Shou and Hangting Chen and Yucheng Zhou and Kaituo Feng and Wenbo Hu and Yi-Fan Zhang and Yunlong Lin and Wenxuan Huang and Mingyang Song and Dasen Dai and Bolin Jiang and Manyuan Zhang and Shi-Xue Zhang and Zhengkai Jiang and Lucas Wang and Zhao Zhong and Yu Cheng and Nanyun Peng},
  year = {2026},
  abstract = {Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agen},
  url = {https://huggingface.co/papers/2603.29620},
  keywords = {unified multimodal models, agentic modeling, world-grounded image synthesis, prompt understanding, multimodal evidence searching, grounded recaptioning, agent trajectories, FactIP, closed-source models, code available, huggingface daily},
  eprint = {2603.29620},
  archiveprefix = {arXiv},
}

Metadata

{}