Assessing the ability of large language models to summarize and generate maxillofacial prosthetic treatment options.

Journal: Journal of prosthodontics : official journal of the American College of Prosthodontists
Published Date:

Abstract

PURPOSE: The aim was to evaluate the ability of four large language models (LLMs) (OpenAI's ChatGPT-3.5, Microsoft 365 Copilot, DeepSeek-R1, and Google Gemini 2.5 Pro) to develop treatment options when presented with clinical cases published in the maxillofacial prosthodontics literature. MATERIALS AND METHODS: Six maxillofacial case reports were fed to the LLMs following a prompt that requested prosthodontic treatment options from the perspective of a prosthodontist. Expert evaluators scored the relevance, clarity, depth, focus, and coherence of the responses. Statistical analyses, including descriptive statistics, two-way analysis of variance (ANOVA), post hoc Tukey tests, Pearson correlation analyses, and intraclass correlation coefficients (ICCs), were performed (α < 0.05). RESULTS: There were significant differences among the total mean relevance (p = 0.003), clarity (p = 0.006), depth (p < 0.001), focus (p < 0.001), and coherence (p < 0.001) scores of chatbots. Copilot consistently scored the lowest, and Gemini or DeepSeek scored the highest for all five factors. Depth (p = 0.006), focus (p = 0.024), and coherence (p = 0.013) scores of senior prosthodontists were slightly higher than those of junior prosthodontists. Pearson correlation analysis revealed positive correlations between the total mean scores for all five factors (p < 0.001). CONCLUSIONS: The study demonstrates the ability of LLMs to develop maxillofacial prosthetic treatment plans tailored to specific clinical scenarios. There were significant differences between the abilities of the LLMs evaluated in this study. Copilot scored the lowest for all factors evaluated, and Gemini and/or DeepSeek scored the highest.

Authors

Keywords

No keywords available for this article.