Aligning Evaluation with Clinical Priorities: Calibration, Label Shift, and Error Costs
Journal:
arXiv
Published Date:
Jun 17, 2025
Abstract
Machine learning-based decision support systems are increasingly deployed in
clinical settings, where probabilistic scoring functions are used to inform and
prioritize patient management decisions. However, widely used scoring rules,
such as accuracy and AUC-ROC, fail to adequately reflect key clinical
priorities, including calibration, robustness to distributional shifts, and
sensitivity to asymmetric error costs. In this work, we propose a principled
yet practical evaluation framework for selecting calibrated thresholded
classifiers that explicitly accounts for the uncertainty in class prevalences
and domain-specific cost asymmetries often found in clinical settings. Building
on the theory of proper scoring rules, particularly the Schervish
representation, we derive an adjusted variant of cross-entropy (log score) that
averages cost-weighted performance over clinically relevant ranges of class
balance. The resulting evaluation is simple to apply, sensitive to clinical
deployment conditions, and designed to prioritize models that are both
calibrated and robust to real-world variations.