DidSee: Diffusion-Based Depth Completion for Material-Agnostic Robotic Perception and Manipulation
Journal:
arXiv
Published Date:
Jun 26, 2025
Abstract
Commercial RGB-D cameras often produce noisy, incomplete depth maps for
non-Lambertian objects. Traditional depth completion methods struggle to
generalize due to the limited diversity and scale of training data. Recent
advances exploit visual priors from pre-trained text-to-image diffusion models
to enhance generalization in dense prediction tasks. However, we find that
biases arising from training-inference mismatches in the vanilla diffusion
framework significantly impair depth completion performance. Additionally, the
lack of distinct visual features in non-Lambertian regions further hinders
precise prediction. To address these issues, we propose \textbf{DidSee}, a
diffusion-based framework for depth completion on non-Lambertian objects.
First, we integrate a rescaled noise scheduler enforcing a zero terminal
signal-to-noise ratio to eliminate signal leakage bias. Second, we devise a
noise-agnostic single-step training formulation to alleviate error accumulation
caused by exposure bias and optimize the model with a task-specific loss.
Finally, we incorporate a semantic enhancer that enables joint depth completion
and semantic segmentation, distinguishing objects from backgrounds and yielding
precise, fine-grained depth maps. DidSee achieves state-of-the-art performance
on multiple benchmarks, demonstrates robust real-world generalization, and
effectively improves downstream tasks such as category-level pose estimation
and robotic grasping.