Paper Detail

Masked Generative Transformer Is What You Need for Image Editing

Wei Chow, Linfeng Li, Xian Sun, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, Songhua Liu

arxiv Score 11.3

Published 2026-05-11 · First seen 2026-05-13

Research Track A · General AI

Abstract

Diffusion models dominate image editing, yet their global denoising mechanism entangles edited regions with surrounding context, causing modifications to propagate into areas that should remain intact. We propose a fundamentally different approach by leveraging Masked Generative Transformers (MGTs), whose localized token-prediction paradigm naturally confines changes to intended regions. We present EditMGT, an MGT-based editing framework that is the first of its kind. Our approach employs multi-layer attention consolidation to aggregate cross-attention maps into precise edit localization signals, and region-hold sampling to explicitly prevent token flipping in non-target areas. To support training, we construct CrispEdit-2M, a 2M-sample high-resolution (>1024) editing dataset spanning seven categories. With only 960M parameters, EditMGT achieves state-of-the-art image similarity on multiple benchmarks while delivering 6x faster editing, demonstrating that MGTs offer a compelling alternative to diffusion-based editing.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{chow2026masked,
  title = {Masked Generative Transformer Is What You Need for Image Editing},
  author = {Wei Chow and Linfeng Li and Xian Sun and Lingdong Kong and Zefeng Li and Qi Xu and Hang Song and Tian Ye and Xian Wang and Jinbin Bai and Shilin Xu and Xiangtai Li and Junting Pan and Shaoteng Liu and Ran Zhou and Tianshu Yang and Songhua Liu},
  year = {2026},
  abstract = {Diffusion models dominate image editing, yet their global denoising mechanism entangles edited regions with surrounding context, causing modifications to propagate into areas that should remain intact. We propose a fundamentally different approach by leveraging Masked Generative Transformers (MGTs), whose localized token-prediction paradigm naturally confines changes to intended regions. We present EditMGT, an MGT-based editing framework that is the first of its kind. Our approach employs multi-},
  url = {https://arxiv.org/abs/2605.10859},
  keywords = {cs.CV, cs.LG},
  eprint = {2605.10859},
  archiveprefix = {arXiv},
}

Metadata

{}