Paper Detail

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, Yan Li, Yubo Wang, Zhijie Cao, Zhiqian Lin, Zhitao Yang, Zhongang Cai, Yuwei Niu, Yue Zhu, Bo Liu, Chengguang Lv, Haojia Yu, Haozhe Xie, Hongli Wang, Jianan Fan, Jiaqi Li, Jiefan Lu, Jingcheng Ni, Junxiang Xu, Kaihuan Liang, Lianqiang Shi, Linjun Dai, Linyan Wang, Oscar Qian, Peng Gao, Pengfei Liu, Qingping Sun, Rui Shen, Ruisi Wang, Shengnan Ma, Shuang Yang, Siyi Xie, Siying Li, Tianbo Zhong, Xiangli Kong, Xuanke Shi, Yang Gao, Yongqiang Yao, Yves Wang, Zhengqi Bai, Zhengyu Lin, Zixin Yin, Wenxiu Sun, Ruihao Gong, Quan Wang, Lewei Lu, Lei Yang, Ziwei Liu, Dahua Lin

Browse

Workflow Queues

arxiv Score 12.8

Published 2026-05-12 · First seen 2026-05-13

General AI

Open paper source

Abstract

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{diao2026sensenova,
  title = {SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture},
  author = {Haiwen Diao and Penghao Wu and Hanming Deng and Jiahao Wang and Shihao Bai and Silei Wu and Weichen Fan and Wenjie Ye and Wenwen Tong and Xiangyu Fan and Yan Li and Yubo Wang and Zhijie Cao and Zhiqian Lin and Zhitao Yang and Zhongang Cai and Yuwei Niu and Yue Zhu and Bo Liu and Chengguang Lv and Haojia Yu and Haozhe Xie and Hongli Wang and Jianan Fan and Jiaqi Li and Jiefan Lu and Jingcheng Ni and Junxiang Xu and Kaihuan Liang and Lianqiang Shi and Linjun Dai and Linyan Wang and Oscar Qian and Peng Gao and Pengfei Liu and Qingping Sun and Rui Shen and Ruisi Wang and Shengnan Ma and Shuang Yang and Siyi Xie and Siying Li and Tianbo Zhong and Xiangli Kong and Xuanke Shi and Yang Gao and Yongqiang Yao and Yves Wang and Zhengqi Bai and Zhengyu Lin and Zixin Yin and Wenxiu Sun and Ruihao Gong and Quan Wang and Lewei Lu and Lei Yang and Ziwei Liu and Dahua Lin},
  year = {2026},
  abstract = {Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NE},
  url = {https://arxiv.org/abs/2605.12500},
  keywords = {cs.CV},
  eprint = {2605.12500},
  archiveprefix = {arXiv},
}

Metadata

{}