Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents.

Journal: BMC medical education
Published Date:

Abstract

BACKGROUND: Artificial intelligence (AI) has become a transformative tool in medical education and assessment. Despite advancements, AI models such as GPT-4o demonstrate variable performance on high-stakes examinations. This study compared the performance of four AI models (Llama-3, Gemini, GPT-4o, and Copilot) with specialists and residents on European General Surgery Board test questions, focusing on accuracy across question formats, lengths, and difficulty levels.

Authors

  • Melih Can Gül
    Department of Gastrointestinal Surgery, Afyonkarahisar State Hospital, Afyonkarahisar, Türkiye. opdrmelihcangul@gmail.com.