Evaluating medical AI systems in dermatology under uncertain ground truth.
Journal:
Medical image analysis
Published Date:
Apr 9, 2025
Abstract
For safety, medical AI systems undergo thorough evaluations before deployment, validating their predictions against a ground truth which is assumed to be fixed and certain. However, in medical applications, this ground truth is often curated in the form of differential diagnoses provided by multiple experts. While a single differential diagnosis reflects the uncertainty in one expert assessment, multiple experts introduce another layer of uncertainty through potential disagreement. Both forms of uncertainty are ignored in standard evaluation which aggregates these differential diagnoses to a single label. In this paper, we show that ignoring uncertainty leads to overly optimistic estimates of model performance, therefore underestimating risk associated with particular diagnostic decisions. Moreover, point estimates largely ignore dramatic differences in uncertainty of individual cases. To this end, we propose a statistical aggregation approach, where we infer a distribution on probabilities of underlying medical condition candidates themselves, based on observed annotations. This formulation naturally accounts for the potential disagreements between different experts, as well as uncertainty stemming from individual differential diagnoses, capturing the entire ground truth uncertainty. Practically, our approach boils down to generating multiple samples of medical condition probabilities, then evaluating and averaging performance metrics based on these sampled probabilities, instead of relying on a single point estimate. This allows us to provide uncertainty-adjusted estimates of common metrics of interest such as top-k accuracy and average overlap. In the skin condition classification problem of Liu et al., (2020), our methodology reveals significant ground truth uncertainty for most data points and demonstrates that standard evaluation techniques can overestimate performance by several percentage points. We conclude that, while assuming a crisp ground truth may be acceptable for many AI applications, a more nuanced evaluation protocol acknowledging the inherent complexity and variability of differential diagnoses should be utilized in medical diagnosis.