A comparative analysis of the performance of leading large language models on the endodontics section of the dentistry specialization exam in Türkiye.

Journal: PloS one
Published Date:

Abstract

OBJECTIVE: This study aimed to evaluate and compare the performance of eight contemporary LLMs on the endodontics section of the DUS, assessing their accuracy in both theoretical knowledge and simulated clinical scenarios from historical exam data. METHODS: The performance of eight different large language models (Claude 4, DeepSeek V3, Gemini 2.5 Pro, ChatGPT-4o, ChatGPT-5, Grok 4, LLaMA 4, and Perplexity) was evaluated using 127 multiple-choice endodontics questions from the Specialization Exam in Dentistry (DUS) administered by the Student Selection and Placement Center (ÖSYM) between 2012 and 2021. The models' responses were compared against the official answer keys. Statistical analyses were performed using Pearson's chi-square and McNemar tests, with a significance level of α = 0.05. RESULTS: Significant differences existed among LLMs in overall accuracy (p < 0.001). Gemini 2.5 Pro achieved the highest accuracy (90.6%), outperforming ChatGPT-4o (61.4%) and LLaMA 4 (71.7%). In Clinical Practice Questions (CPQ), Gemini 2.5 Pro (93.9%) surpassed ChatGPT-4o (57.6%; p = 0.019). For General Knowledge and Concept Questions (GKCQ), Gemini 2.5 Pro (89.4%), Grok 4 (85.1%), and DeepSeek V3 (84.0%) exceeded ChatGPT-4o (62.8%; p < 0.001). No significant intra-model differences emerged between CPQ and GKCQ performance (p > 0.05). CONCLUSION: Contemporary LLMs demonstrate substantial competence in endodontic knowledge, with Gemini 2.5 Pro excelling in both theoretical and clinical queries. However, significant performance variability across models (61.4%-90.6%) and the complexity of retrieving and resolving clinical exam queries necessitate domain-specific optimization and expert oversight for reliable integration into dental education and practice.

Authors

Keywords

No keywords available for this article.