A comparative evaluation of large language models in diagnosis and treatment planning in restorative dentistry.

Journal: BMC oral health
Published Date:

Abstract

OBJECTIVE: With the advancement of artificial intelligence (AI), large language models (LLMs) have become an alternative source of information in dentistry. These LLMs, which can be trained on large data sets, can answer medical questions and provide references, but they can pose problems in terms of ethics and accurate information. This research aims to evaluate the accuracy of five different LLMs in diagnosis and treatment planning in the field of restorative dentistry. METHODOLOGY: The 20 most common cases encountered in a restorative dentistry clinic were formulated into questions. The validity of the questions was assessed using the Lawshe Content Validity Index. The questions were posed to five different LLMs: ChatGPT-5, Deepseek V3.2, Claude Sonnet 4.5, Microsoft Copilot, and Google Gemini 3 Flash. Each model was asked to create a diagnosis and treatment plan for each case. The responses were evaluated by 42 restorative dentistry specialists using a Likert Scale. Additionally, the accuracy of the references provided in the responses was evaluated by the article authors. The obtained data were analyzed using the non-parametric Kruskal-Wallis test, and the Dunn multiple comparison test was applied in cases where significant differences were detected. RESULTS: Statistically significant differences were found between the models for 15 out of 20 questions (p < 0.05). A significant difference was also found in terms of total median scores (p < 0.001), with Google Gemini 3 Flash (median:83) and ChatGPT-5 (median:81) achieving the highest scores. Reference quality was evaluated using a four-category framework. Claude Sonnet 4.5 and Google Gemini 3 Flash demonstrated the highest proportions of accurate and relevant references (85.7% and 84.2%, respectively), while DeepSeek V3.2 exhibited the highest fabrication rate (55.6%). CONCLUSIONS: Based on specialist-evaluated response quality, no model demonstrated consistent and superior performance across all clinical scenarios. LLMs appear to have potential as supplementary information resources for clinicians in restorative dentistry; however, their clinical integration, impact on patient outcomes, and real-world usability remain to be established in future research.

Authors

Keywords

No keywords available for this article.