Semantic discrete decoder based on adaptive pixel clustering for monocular depth estimation.
Journal:
Neural networks : the official journal of the International Neural Network Society
Published Date:
Sep 1, 2025
Abstract
Monocular depth estimation (MDE) has long been a popular and challenging task. Currently, mainstream methods mainly include regression methods based on geometric constraints and ordinal regression methods based on discretized depth intervals. However, they both overlook the fact that depth values within objects often exhibit some degree of continuity, while depth values between objects exhibit varying degrees of discontinuity. Based on this, we propose a more general approach to monocular depth estimation called APCDepth. This method does not treat MDE as an ordinal regression task but rather as a continuous regression task to ensure the continuity of depth values within objects. To focus on the discontinuity of depth values between objects, we propose an Adaptive Pixel Clustering (APC) module to semantically discretize encoder deep features, and align the discretized feature maps to a larger resolution using our proposed Cross-Semantic Alignment (CSA) module. Additionally, to tackle the quadratic complexity issue introduced by Transformers as decoders in depth estimation, we propose a Deformable Feature Pyramid Network (DeFPN) with sparse attention for multi-scale feature fusion. Furthermore, experimental results on the KITTI and NYU datasets validate the effectiveness of APCDepth and demonstrate outstanding performance.