Paper Detail

PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation

Haofei Xu, Rundi Wu, Philipp Henzler, Nikolai Kalischek, Michael Oechsle, Fabian Manhardt, Marc Pollefeys, Andreas Geiger, Federico Tombari, Michael Niemeyer

Browse

Workflow Queues

arxiv Score 6.6

Published 2026-07-02 · First seen 2026-07-03

General AI

Open paper source

Abstract

State-of-the-art single-image 3D reconstruction methods often rely on complex hybrid architectures and loss functions, or compress geometry into latent spaces in order to leverage pre-trained latent diffusion models. In this work, we show that such architectural overhead and intricate loss formulations are unnecessary. We introduce a minimalist pixel-space Diffusion Transformer, built on a plain ViT, that operates directly on raw 3D point map patches and is conditioned on image tokens from a pre-trained DINOv3. Unlike existing latent diffusion approaches, we train our diffusion backbone entirely from scratch, eliminating the need for point map tokenizers. Despite its simplicity, our approach surpasses complex latent-based diffusion models while remaining significantly simpler than hybrid alternatives. Notably, it produces sharper geometric structure and is more robust in highly ambiguous regions, such as transparent objects.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: later
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{xu2026pointdit,
  title = {PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation},
  author = {Haofei Xu and Rundi Wu and Philipp Henzler and Nikolai Kalischek and Michael Oechsle and Fabian Manhardt and Marc Pollefeys and Andreas Geiger and Federico Tombari and Michael Niemeyer},
  year = {2026},
  abstract = {State-of-the-art single-image 3D reconstruction methods often rely on complex hybrid architectures and loss functions, or compress geometry into latent spaces in order to leverage pre-trained latent diffusion models. In this work, we show that such architectural overhead and intricate loss formulations are unnecessary. We introduce a minimalist pixel-space Diffusion Transformer, built on a plain ViT, that operates directly on raw 3D point map patches and is conditioned on image tokens from a pre},
  url = {https://arxiv.org/abs/2607.02515},
  keywords = {cs.CV},
  eprint = {2607.02515},
  archiveprefix = {arXiv},
}

Metadata

{}