Pixel‑Perfect Diffusion Transformers for Depth Estimation

← Back to Papers

Research Paper

Pixel‑Perfect Diffusion Transformers for Depth Estimation

Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Haiyang Sun, Bing Wang...

computer-vision robotics advanced • arXiv ↗ • HuggingFace ↗

Published: 2026-01-08 • Added: 2026-01-09

Key Insights

Introduces **Pixel‑Perfect Depth (PPD)**, a monocular depth model that operates directly in pixel space using diffusion transformers, eliminating flying pixels and preserving fine scene details.
**Semantics‑Prompted DiT** injects high‑level semantic embeddings from large vision foundation models into the diffusion process, guiding global structure while still allowing the model to recover sharp local geometry.
The **Cascade DiT** architecture progressively upsamples the token grid (e.g., 1/16 → 1/8 → 1/4 resolution), dramatically reducing computation compared with a full‑resolution diffusion pass while still achieving higher accuracy.
Extends to video with **Semantics‑Consistent DiT** that extracts temporally stable semantics from a multi‑view geometry model, and performs **reference‑guided token propagation** to keep depth predictions temporally coherent with minimal overhead.
Empirically outperforms all prior generative monocular and video depth methods, producing cleaner point clouds that retain fine‑grained geometry suitable for downstream 3‑D reconstruction tasks.

Abstract

Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.

Full Analysis

# Pixel‑Perfect Diffusion Transformers for Depth Estimation **Authors:** Gangwei Xu, Haotong Lin, Hongcheng Luo, Haiyang Sun, Bing Wang... **Source:** [HuggingFace](None) | [arXiv](https://arxiv.org/abs/2601.05246) **Published:** 2026-01-08 ## Summary - Introduces **Pixel‑Perfect Depth (PPD)**, a monocular depth model that operates directly in pixel space using diffusion transformers, eliminating flying pixels and preserving fine scene details. - **Semantics‑Prompted DiT** injects high‑level semantic embeddings from large vision foundation models into the diffusion process, guiding global structure while still allowing the model to recover sharp local geometry. - The **Cascade DiT** architecture progressively upsamples the token grid (e.g., 1/16 → 1/8 → 1/4 resolution), dramatically reducing computation compared with a full‑resolution diffusion pass while still achieving higher accuracy. - Extends to video with **Semantics‑Consistent DiT** that extracts temporally stable semantics from a multi‑view geometry model, and performs **reference‑guided token propagation** to keep depth predictions temporally coherent with minimal overhead. - Empirically outperforms all prior generative monocular and video depth methods, producing cleaner point clouds that retain fine‑grained geometry suitable for downstream 3‑D reconstruction tasks. ## Abstract Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models. --- *Topics: computer-vision, robotics* *Difficulty: advanced* *Upvotes: 0*