Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents.
Journal:
BMC medical education
Published Date:
Aug 23, 2025
Abstract
BACKGROUND: Artificial intelligence (AI) has become a transformative tool in medical education and assessment. Despite advancements, AI models such as GPT-4o demonstrate variable performance on high-stakes examinations. This study compared the performance of four AI models (Llama-3, Gemini, GPT-4o, and Copilot) with specialists and residents on European General Surgery Board test questions, focusing on accuracy across question formats, lengths, and difficulty levels.