Limits of Artificial Intelligence Models for Skin Cancer Diagnosis in Realistic Settings.
Journal:
JAMA dermatology
Published Date:
Jun 3, 2026
Abstract
IMPORTANCE: Artificial intelligence (AI) systems for skin cancer detection perform well in controlled settings but frequently underperform in everyday clinical practice, raising critical questions about their readiness for deployment. OBJECTIVE: To compare the diagnostic accuracy of AI algorithms vs human evaluators across varying expertise levels for skin lesion diagnosis, including rare and atypical cases, in a realistic clinical context. DESIGN, SETTING, AND PARTICIPANTS: This multi-institutional diagnostic study compared diagnostic performance among AI models and physician readers with varying dermatological expertise, ranging from less than 1 year to more than 10 years of experience. A dataset of dermatological images representing everyday clinical scenarios was used and contained 1117 cases, including clinical and dermoscopic images with associated metadata. Study inclusion spanned from March 16, 2023, to August 1, 2025. EXPOSURES: Three AI algorithms: a first-generation convolutional neural network (CNN) and 2 foundation models (PanDerm unimodal and multimodal). Human readers evaluated 100 stratified, random cases drawn from the same dataset. MAIN OUTCOMES AND MEASURES: The primary outcome was reader-level multiclass diagnostic accuracy for skin lesion classification. Secondary outcomes were binary benign vs malignant sensitivity, specificity, and balanced accuracy. Performance was compared between AI algorithms and human readers stratified by experience level. RESULTS: A total of 652 physicians (median [IQR] age, 33 [29-37] years; 559 [85.7%] female) contributed to 1092 test iterations. All human readers outperformed the CNN (mean [SD] accuracy, 65.9% [10.5%] vs 56.7% [3.9%]; difference, 9.2 percentage points [pp]; 95% CI, -9.8 to 8.5 pp; Pā<ā.001). Unimodal accuracy exceeded readers with less than 3 years of experience (mean [SD] accuracy, 72.2% [3.5%] vs 68.2% [7.6%]; difference, 4.0 pp; 95% CI, 3.2-4.9 pp; Pā<ā.001). With a mean (SD) accuracy of 74.2% (5.7%), experts with more than 10 years of experience achieved the highest multiclass diagnostic accuracy, outperforming all AI models on this primary end point, which included 56.7% (3.9%) for CNN, 72.2% (3.5%) for the unimodal model, and 66.3% (3.8%) for the multimodal model. CONCLUSIONS AND RELEVANCE: In this diagnostic study, a modern foundation model surpassed readers with less than 3 years of experience on accuracy of skin lesion diagnosis and matched those with 3 to 10 years of experience but remained inferior to experts with more than 10 years of experience, highlighting both the promise and current limitations of AI in dermatologic diagnosis.
Authors
Keywords
No keywords available for this article.