Beyond the Surface: Assessing GPT-4's Accuracy in Detecting Melanoma and Suspicious Skin Lesions From Dermoscopic Images.

Journal: Plastic surgery (Oakville, Ont.)
Published Date:

Abstract

Introduction: Self-examinations for skin cancer detection are limited by sensitivity. ChatGPT-4 has image recognition capabilities that can be a useful adjunct for screening cancers and tele-health applications. This study investigated the efficacy of ChatGPT-4 in identifying skin lesions. Methods: Dermoscopic images were retrospectively selected from the PH2 dataset, categorized by clinical diagnosis, and uploaded to ChatGPT-4 with a predesigned prompt. Responses were compared against clinical diagnoses. Confidence intervals were calculated using the bootstrap method assessing precision and significance was calculated using McNemar's test. Analyses were performed using Jupyter Notebook and Python. Results: The GPT-4 model showed moderate performance in melanoma detection with 68.5% accuracy, 52.5% sensitivity, and 72.5% specificity, significantly differing from the clinical standard (P = .002). For suspicious lesion detection, it performed better with 68.0% accuracy, 78.0% precision, and 70.0% F-measure, still not closely matching clinical diagnosis for atypical nevi and melanoma (P = .0169). Conclusion: The statistical difference between ChatGPT-4 diagnosis of melanoma and suspicious lesions compared with clinical diagnoses and other AI models suggests the need for improvement in ChatGPT-4 algorithms. This study's limitations included the use of a secondary care database with a higher melanoma incidence, high-quality dermoscopic images that limit generalizability, a small sample size lacking diversity, and the need for larger datasets to validate findings in broader contexts.

Authors

Keywords

No keywords available for this article.