Paper Detail

MMDiff: Extending Diffusion Transformers for Multi-Modal Generation

Yagmur Akarken, Orest Kupyn, Christian Rupprecht

huggingface Score 4.5

Published 2026-06-15 · First seen 2026-06-16

General AI

Abstract

Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
later
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{akarken2026mmdiff,
  title = {MMDiff: Extending Diffusion Transformers for Multi-Modal Generation},
  author = {Yagmur Akarken and Orest Kupyn and Christian Rupprecht},
  year = {2026},
  abstract = {Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporal},
  url = {https://huggingface.co/papers/2606.16673},
  keywords = {diffusion transformers, denoising trajectory, multi-modal generative system, lightweight decoder heads, multi-timestep feature fusion, spatially varying aggregation weights, semantic segmentation, salient object detection, depth estimation, concept-driven attention extraction, DINOv3, synthetic data generation, huggingface daily},
  eprint = {2606.16673},
  archiveprefix = {arXiv},
}

Metadata

{}