Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents.

Journal: BMC medical education

Published Date: Aug 23, 2025

Abstract

BACKGROUND: Artificial intelligence (AI) has become a transformative tool in medical education and assessment. Despite advancements, AI models such as GPT-4o demonstrate variable performance on high-stakes examinations. This study compared the performance of four AI models (Llama-3, Gemini, GPT-4o, and Copilot) with specialists and residents on European General Surgery Board test questions, focusing on accuracy across question formats, lengths, and difficulty levels.

Authors

Melih Can Gül

Department of Gastrointestinal Surgery, Afyonkarahisar State Hospital, Afyonkarahisar, Türkiye. opdrmelihcangul@gmail.com.

Keywords

Adult Artificial Intelligence Clinical Competence Educational Measurement Europe Female General Surgery Humans Internship and Residency Language Large Language Models Male Specialty Boards

External Resources

View on PubMed Access via DOI PubMed (40849634)

Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals