An Assessment of the Performance of Different Chatbots on Shoulder and Elbow Questions.

Journal: Journal of clinical medicine

Published Date: Mar 27, 2025

Abstract

The utility of artificial intelligence (AI) in medical education has recently garnered significant interest, with several studies exploring its applications across various educational domains; however, its role in orthopedic education, particularly in shoulder and elbow surgery, remains scarcely studied. This study aims to evaluate the performance of multiple AI models in answering shoulder- and elbow-related questions from the AAOS ResStudy question bank. : A total of 50 shoulder- and elbow-related questions from the AAOS ResStudy question bank were selected for the study. Questions were categorized according to anatomical location, topic, concept, and difficulty. Each question, along with the possible multiple-choice answers, was provided to each chatbot. The performance of each chatbot was recorded and analyzed to identify significant differences between the chatbots' performances across various categories. : The overall average performance of all chatbots was 60.4%. There were significant differences in the performances of different chatbots ( = 0.034): GPT-4o performed best, answering 74% of the questions correctly. AAOS members outperformed all chatbots, with an average accuracy of 79.4%. There were no significant differences in performance between shoulder and elbow questions ( = 0.931). Topic-wise, chatbots did worse on questions relating to "Adhesive Capsulitis" than those relating to "Instability" ( = 0.013), "Nerve Injuries" ( = 0.002), and "Arthroplasty" ( = 0.028). Concept-wise, the best performance was seen in "Diagnosis" (71.4%), but there were no significant differences in scores between different chatbots. Difficulty analysis revealed that chatbots performed significantly better on easy questions (68.5%) compared to moderate (45.4%; = 0.04) and hard questions (40.0%; = 0.012). : AI chatbots show promise as supplementary tools in medical education and clinical decision-making, but their limitations necessitate cautious and complementary use alongside expert human judgment.

Authors

Mohamad Y Fares

Division of Shoulder and Elbow Surgery, Rothman Orthopaedic Institute, Philadelphia, PA 19107, USA.
Tarishi Parmar

Penn State College of Medicine, The Pennsylvania State University, Hershey, PA 17033, USA.
Peter Boufadel

Division of Shoulder and Elbow Surgery, Rothman Orthopaedic Institute, Philadelphia, PA 19107, USA.
Mohammad Daher

Department of Orthopedic Surgery, The Warren Alpert Medical School, Brown University, Providence, RI 02912, USA.
Jonathan Berg

Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA 19107, USA.
Austin Witt

Baylor University Medical Center, Dallas, TX 75246, USA.
Brian W Hill

Palm Beach Orthopaedic Institute, West Palm Beach, FL 33401, USA.
John G Horneff

Division of Shoulder and Elbow Surgery, Department of Orthopaedics, University of Pennsylvania, Philadelphia, PA 19104, USA.
Adam Z Khan

Southern Permanente Medical Group, Pasadena, CA 91188, USA.
Joseph A Abboud

Division of Shoulder and Elbow Surgery, Rothman Orthopaedic Institute, Philadelphia, PA 19107, USA.

Keywords

No keywords available for this article.

External Resources

View on PubMed Access via DOI PubMed (40217738)

An Assessment of the Performance of Different Chatbots on Shoulder and Elbow Questions.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals