Human-like monocular depth biases in deep neural networks.

Journal: PLoS computational biology
Published Date:

Abstract

Human depth perception from 2D images is systematically distorted, yet the nature of these distortions is not fully understood. By examining error patterns in depth estimation for both humans and deep neural networks (DNNs), which have shown remarkable abilities in monocular depth estimation, we can gain insights into constructing functional models of this human 3D vision and designing artificial models with improved interpretability. Here, we propose a comprehensive human-DNN comparison framework for a monocular depth judgment task. Using a novel human-annotated dataset of natural indoor scenes and a systematic analysis of absolute depth judgments, we investigate error patterns in both humans and DNNs. Employing exponential-affine fitting, we decompose depth estimation errors into depth compression, per-image affine transformations (including scaling, shearing, and translation), and residual errors. Our analysis reveals that human depth judgments exhibit systematic and consistent biases, including depth compression, a vertical bias (perceiving objects in the lower visual field as closer), and consistent per-image affine distortions across participants. Intriguingly, we find that DNNs with higher accuracy partially recapitulate these human biases, demonstrating greater similarity in affine parameters and residual error patterns. This suggests that these seemingly suboptimal human biases may reflect efficient, ecologically adapted strategies for depth inference from inherently ambiguous monocular images. However, while DNNs capture metric-level residual error patterns similar to humans, they fail to reproduce human-level accuracy in ordinal depth perception within the affine-invariant space. These findings underscore the importance of evaluating error patterns beyond raw accuracy, providing new insights into how humans and computational models resolve depth ambiguity. Our dataset and methodology provide a framework for evaluating the alignment between computational models and human perceptual biases, thereby advancing our understanding of visual space representation and guiding the development of models that more faithfully capture human depth perception.

Authors

  • Yuki Kubota
    Communication Science Laboratories, NTT, Inc., Kanagawa, Japan.
  • Taiki Fukiage
    Communication Science Laboratories, NTT, Inc., Kanagawa, Japan.

Keywords

No keywords available for this article.