Evaluating medical AI systems in dermatology under uncertain ground truth.

Journal: Medical image analysis

Published Date: Apr 9, 2025

Abstract

For safety, medical AI systems undergo thorough evaluations before deployment, validating their predictions against a ground truth which is assumed to be fixed and certain. However, in medical applications, this ground truth is often curated in the form of differential diagnoses provided by multiple experts. While a single differential diagnosis reflects the uncertainty in one expert assessment, multiple experts introduce another layer of uncertainty through potential disagreement. Both forms of uncertainty are ignored in standard evaluation which aggregates these differential diagnoses to a single label. In this paper, we show that ignoring uncertainty leads to overly optimistic estimates of model performance, therefore underestimating risk associated with particular diagnostic decisions. Moreover, point estimates largely ignore dramatic differences in uncertainty of individual cases. To this end, we propose a statistical aggregation approach, where we infer a distribution on probabilities of underlying medical condition candidates themselves, based on observed annotations. This formulation naturally accounts for the potential disagreements between different experts, as well as uncertainty stemming from individual differential diagnoses, capturing the entire ground truth uncertainty. Practically, our approach boils down to generating multiple samples of medical condition probabilities, then evaluating and averaging performance metrics based on these sampled probabilities, instead of relying on a single point estimate. This allows us to provide uncertainty-adjusted estimates of common metrics of interest such as top-k accuracy and average overlap. In the skin condition classification problem of Liu et al., (2020), our methodology reveals significant ground truth uncertainty for most data points and demonstrates that standard evaluation techniques can overestimate performance by several percentage points. We conclude that, while assuming a crisp ground truth may be acceptable for many AI applications, a more nuanced evaluation protocol acknowledging the inherent complexity and variability of differential diagnoses should be utilized in medical diagnosis.

Authors

David Stutz

Google DeepMind, United Kingdom. Electronic address: dstutz@google.com.
Ali Taylan Cemgil

Department of Computer Engineering, Boğaziçi University, Istanbul, Turkey.
Abhijit Guha Roy

Department of Electrical Engineering, Indian Institute of Technology Kharagpur, West Bengal, India.
Tatiana Matejovicova

Google DeepMind, United Kingdom.
Melih Barsbey

Bogazici University, Istanbul, Turkey.
Patricia Strachan

Google Research, Mountain View, CA, USA.
Mike Schaekermann

Google Health, Google LLC, Mountain View, California.
Jan Freyberg

Google Research, Mountain View, CA, USA.
Rajeev Rikhye

Google, United States.
Beverly Freeman

Google, United States.
Javier Perez Matos

Google, United States.
Umesh Telang

Google Research, Mountain View, CA, USA.
Dale R Webster

Google Inc, Mountain View, California.
Yuan Liu

Department of General Surgery, Wuxi People's Hospital Affiliated to Nanjing Medical University, Wuxi, China.
Greg S Corrado

Google Health, Palo Alto, CA USA.
Yossi Matias

Google Research, Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA, USA.
Pushmeet Kohli

DeepMind, London, UK.
Yun Liu

Google Health, Palo Alto, CA USA.
Arnaud Doucet

Google DeepMind, United Kingdom.
Alan Karthikesalingam

Department of Outcomes Research, St George's Vascular Institute, London, SW17 0QT, United Kingdom.

Keywords

Algorithms Artificial Intelligence Dermatology Diagnosis, Differential Humans Skin Diseases Uncertainty

External Resources

View on PubMed Access via DOI PubMed (40222196)

Evaluating medical AI systems in dermatology under uncertain ground truth.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Evaluating medical AI systems in dermatology under uncertain ground truth.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals