ESR Essentials: common performance metrics in AI-practice recommendations by the European Society of Medical Imaging Informatics.
Journal:
European radiology
Published Date:
Aug 3, 2025
Abstract
This article provides radiologists with practical recommendations for evaluating AI performance in radiology, ensuring alignment with clinical goals and patient safety. It outlines key performance metrics, including overlap metrics for segmentation, test-based metrics (e.g., sensitivity, specificity, and area under the receiver operating characteristic curve), and outcome-based metrics (e.g., precision, negative predictive value, F1-score, Matthews correlation coefficient, and area under the precision-recall curve). Key recommendations emphasize local validation using independent datasets, selecting task-specific metrics, and considering deployment context to ensure real-world performance matches claimed efficacy. Common pitfalls, such as overreliance on a single metric, misinterpretation in low-prevalence settings, and failure to account for clinical workflow, are addressed with mitigation strategies. Additional guidance is provided on threshold selection, prevalence-adjusted evaluation, and AI-generated image quality assessment. This guide equips radiologists to critically evaluate both commercially available and in-house developed AI tools, ensuring their safe and effective integration into clinical practice. CLINICAL RELEVANCE STATEMENT: This review provides guidance on selecting and interpreting AI performance metrics in radiology, ensuring clinically meaningful evaluation and safe deployment of AI tools. By addressing common pitfalls and promoting standardized reporting, it supports radiologists in making informed decisions, ultimately improving diagnostic accuracy and patient outcomes. KEY POINTS: Radiologists must evaluate performance metrics as they reflect acceptable performance in specific datasets but do not guarantee clinical utility. Independent evaluation tailored to the clinical setting is essential. Performance metrics must align with the intended task of the AI application-segmentation, detection, or classification-and be selected based on domain knowledge and clinical context. Sensitivity, specificity, area under the ROC curve, and accuracy must be interpreted with prevalence-dependent metrics (e.g., precision, F1 score, and Matthew's correlation coefficient) calculated for the target population to ensure safe and effective clinical use.
Authors
Keywords
No keywords available for this article.