Quality and Reliability of AI Information on Dental Implant Failure: A Comparative Multi-Model Analysis.
Journal:
The Journal of craniofacial surgery
Published Date:
Jan 19, 2026
Abstract
OBJECTIVE: This study aimed to develop a consensus-based set of patient questions on dental implant failure and to compare the clarity, quality, accuracy, reliability, and readability of responses generated by 4 widely used AI chatbots: ChatGPT-4, DeepSeek-R1, Microsoft Copilot, and Google Gemini. METHODS: Twenty-three expert-validated questions were derived from the EAO 2021 and ICOI Pisa Consensus reports and independently submitted to each AI model under standardized, non-personalized conditions. Responses were assessed using CLEAR criteria, mGQS, a 5-point accuracy scale, the first 8 DISCERN items, and Flesch-based readability indices. Nonparametric tests were used for intermodel comparisons. RESULTS: AI models demonstrated significant variability in performance. Gemini achieved the highest accuracy (P<0.001), whereas ChatGPT-4 exhibited the highest reliability based on DISCERN scores. Copilot generated the most structurally fluent responses, whereas DeepSeek-R1 offered the best readability. Although CLEAR and mGQS scores were high across all systems, readability and linguistic complexity varied markedly. Accuracy, clarity, and reliability were strongly correlated, whereas readability displayed the expected inverse association with grade-level demand. CONCLUSIONS: AI chatbots hold potential as adjunct tools for patient education on implant failure; however, their performance characteristics differ substantially. Gemini excels in accuracy, ChatGPT-4 in reliability, Copilot in fluency, and DeepSeek-R1 in readability. Model-specific guidance and continued refinement are needed to enhance the clinical usefulness and accessibility of AI-generated patient information.
Authors
Keywords
No keywords available for this article.