Evaluating medical AI systems in dermatology under uncertain ground truth.

Journal: Medical image analysis
Published Date:

Abstract

For safety, medical AI systems undergo thorough evaluations before deployment, validating their predictions against a ground truth which is assumed to be fixed and certain. However, in medical applications, this ground truth is often curated in the form of differential diagnoses provided by multiple experts. While a single differential diagnosis reflects the uncertainty in one expert assessment, multiple experts introduce another layer of uncertainty through potential disagreement. Both forms of uncertainty are ignored in standard evaluation which aggregates these differential diagnoses to a single label. In this paper, we show that ignoring uncertainty leads to overly optimistic estimates of model performance, therefore underestimating risk associated with particular diagnostic decisions. Moreover, point estimates largely ignore dramatic differences in uncertainty of individual cases. To this end, we propose a statistical aggregation approach, where we infer a distribution on probabilities of underlying medical condition candidates themselves, based on observed annotations. This formulation naturally accounts for the potential disagreements between different experts, as well as uncertainty stemming from individual differential diagnoses, capturing the entire ground truth uncertainty. Practically, our approach boils down to generating multiple samples of medical condition probabilities, then evaluating and averaging performance metrics based on these sampled probabilities, instead of relying on a single point estimate. This allows us to provide uncertainty-adjusted estimates of common metrics of interest such as top-k accuracy and average overlap. In the skin condition classification problem of Liu et al., (2020), our methodology reveals significant ground truth uncertainty for most data points and demonstrates that standard evaluation techniques can overestimate performance by several percentage points. We conclude that, while assuming a crisp ground truth may be acceptable for many AI applications, a more nuanced evaluation protocol acknowledging the inherent complexity and variability of differential diagnoses should be utilized in medical diagnosis.

Authors

  • David Stutz
    Google DeepMind, United Kingdom. Electronic address: dstutz@google.com.
  • Ali Taylan Cemgil
    Department of Computer Engineering, Boğaziçi University, Istanbul, Turkey.
  • Abhijit Guha Roy
    Department of Electrical Engineering, Indian Institute of Technology Kharagpur, West Bengal, India.
  • Tatiana Matejovicova
    Google DeepMind, United Kingdom.
  • Melih Barsbey
    Bogazici University, Istanbul, Turkey.
  • Patricia Strachan
    Google Research, Mountain View, CA, USA.
  • Mike Schaekermann
    Google Health, Google LLC, Mountain View, California.
  • Jan Freyberg
    Google Research, Mountain View, CA, USA.
  • Rajeev Rikhye
    Google, United States.
  • Beverly Freeman
    Google, United States.
  • Javier Perez Matos
    Google, United States.
  • Umesh Telang
    Google Research, Mountain View, CA, USA.
  • Dale R Webster
    Google Inc, Mountain View, California.
  • Yuan Liu
    Department of General Surgery, Wuxi People's Hospital Affiliated to Nanjing Medical University, Wuxi, China.
  • Greg S Corrado
    Google Health, Palo Alto, CA USA.
  • Yossi Matias
    Google Research, Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA, USA.
  • Pushmeet Kohli
    DeepMind, London, UK.
  • Yun Liu
    Google Health, Palo Alto, CA USA.
  • Arnaud Doucet
    Google DeepMind, United Kingdom.
  • Alan Karthikesalingam
    Department of Outcomes Research, St George's Vascular Institute, London, SW17 0QT, United Kingdom.