Paper Detail

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

Ulrich Prestel, Stefan Andreas Baumann, Nick Stracke, Björn Ommer

Browse

Workflow Queues

arxiv Score 5.8

Published 2026-05-29 · First seen 2026-06-01

General AI

Open paper source

Abstract

Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder

Workflow Status

Review status: pending
Role: unreviewed
Read priority: soon
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{prestel2026rayder,
  title = {RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video},
  author = {Ulrich Prestel and Stefan Andreas Baumann and Nick Stracke and Björn Ommer},
  year = {2026},
  abstract = {Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state},
  url = {https://arxiv.org/abs/2605.31535},
  keywords = {cs.CV, cs.AI, cs.LG},
  eprint = {2605.31535},
  archiveprefix = {arXiv},
}

Metadata

{}