Evaluating Large Language Models on American Board of Anesthesiology-style Anesthesiology Questions: Accuracy, Domain Consistency, and Clinical Implications.

Journal: Journal of cardiothoracic and vascular anesthesia
Published Date:

Abstract

Recent advances in large language models (LLMs) have led to growing interest in their potential applications in medical education and clinical practice. This study evaluated whether five widely used and highly developed LLMs-ChatGPT-4, Gemini, Claude, Microsoft CoPilot, and Meta-could achieve a passing score on the American Board of Anesthesiology (ABA) BASIC Exam. Each model completed three separate sets of 200 multiple-choice questions derived from a widely used review resource, with the content distribution mirroring the ABA BASIC Exam blueprint. All five models demonstrated statistically significant performance above the 70% passing threshold (p < 0.05), with the following averages: ChatGPT-4: 92.0%, Gemini: 89.0%, Claude: 88.3%, Microsoft CoPilot: 91.5%, and Meta: 85.8%. Furthermore, an analysis of variance comparing their mean accuracy scores found no statistically significant difference among them (F = 1.88, p = 0.190). These findings suggest that current LLMs can surpass the minimum competency required for board certification, raising important questions about their future role in medical education and clinical decision-making. Performance on topics central to cardiac, thoracic, and vascular anesthesiology-such as hemodynamic management, cardiopulmonary physiology, and coagulation-was particularly robust, suggesting relevance to both fellowship-level education and complex intraoperative care. While these results highlight the capability of artificial intelligence (AI) to meet standardized medical knowledge benchmarks, their broader implications extend beyond examination performance. As AI continues to evolve, its integration into real-time patient care may transform anesthesiology practice, offering decision-support tools that assist physicians in synthesizing complex clinical data. Further research is needed to explore the reliability, ethical considerations, and real-world applications of AI-driven technologies in patient care settings.

Authors

  • Sagar Patel
    Division of Gastroenterology, Department of Medicine, University of California San, Diego, La Jolla, California, USA.
  • Vinh Ngo
    Creighton University School of Medicine, Phoenix, AZ.
  • Brian Wilhelmi
    Creighton University School of Medicine, Phoenix, AZ; Barrow Neurological Institute, Department of Anesthesiology, Phoenix, AZ.

Keywords

No keywords available for this article.