Evaluation of Closed and Open Large Language Models in Pediatric Cardiology Board Exam Performance

Journal: medRxiv

Published Date: Jan 1, 2025

Abstract

Large language models (LLMs) have gained traction in medicine, but there is limited research comparing closed- and open-source models in subspecialty contexts. This study evaluated ChatGPT-4.0o and DeepSeek–R1 on a pediatric cardiology board-style examination to quantify their accuracy and discuss clinical and educational utility. ChatGPT-4.0o and DeepSeek–R1 were used to answer 88 text-based multiple-choice questions across 11 pediatric cardiology subtopics from a Pediatric Cardiology Board Review textbook. DeepSeek–R1’s processing time per question was measured. Statistical analyses for model comparison were conducted using an unpaired two-tailed t-test, and bivariate correlations were assessed using Pearson’s r. ChatGPT-4.0o and DeepSeek–R1 achieved 70% (62/88) and 68% (60/88) accuracy, respectively (p=0.79). Subtopic accuracy was equal in 5 of 11 chapters, with each model outperforming its counterpart in 3 of 11. DeepSeek–R1’s processing time negatively correlated with accuracy (r = –0.68, p = 0.02). ChatGPT-4.0o and DeepSeek–R1 approached the passing threshold on a pediatric cardiology board examination, with comparable accuracy and potential for open-source models to enhance clinical and educational outcomes while supporting sustainable AI development.

Authors

Nino Nikolovski; Conall T. Morgan; Michael N. Gritti

External Resources

View on medRxiv Access via DOI

Evaluation of Closed and Open Large Language Models in Pediatric Cardiology Board Exam Performance

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Evaluation of Closed and Open Large Language Models in Pediatric Cardiology Board Exam Performance

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals