Accuracy and empathy of AI-based conversational chatbots in response to temporomandibular dysfunction related queries.
Journal:
PEC innovation
Published Date:
Feb 23, 2026
Abstract
OBJECTIVES: To compare the accuracy and empathy of responses generated by artificial intelligence (AI)-based chatbots to commonly asked temporomandibular dysfunction (TMD)-related questions. Additionally, test the performance of an automated text-based empathy detection model against subject matter experts (SMEs) judgments. MATERIALS AND METHODS: TMD-related questions (n = 14) were developed by a multidisciplinary panel of SMEs and categorized into five clinical domains (Diagnosis and testing, Causes and aggravating factors, Symptoms and associated issues, Treatment options, and Management and prognosis). Free-tier implementations of three AI-based chatbots: ChatGPT GPT-3.5 (CG), Claude 3.5 Sonnet (CD), and DeepSeek R1 (DS) were prompted to generate responses to these questions. Responses were rated for accuracy based on the Accuracy of Information (AOI) index, and empathy using a 3-point scale by the SMEs (n = 8). To complement expert assessments, a Bidirectional Encoder Representations from Transformers (BERT)-based empathy detection model was trained on the Empathy in Textual Online Medical Exchanges (EPITOME) dataset and validated against SME ratings. RESULTS: DS generated responses with the highest word count (573.6 ± 132.7); significantly more than CG (263.4 ± 63.5) and CD (186.6 ± 25.6). DS also had the highest accuracy across all clinical domains. Overall accuracy of the responses generated by the three chatbots was high. However, variations in accuracy based on clinical domain of the question were observed. Empathy assessments revealed moderate reliability (correlation ∼0.6) among SMEs. The BERT model showed strong concordance with SME judgments for high-empathy responses but demonstrated lower agreement for low-empathy categorizations. CONCLUSION: AI chatbots show promise in providing accurate information regarding TMDs, but their ability to convey empathy remains limited. The observed differences in accuracy and empathy among the three AI chatbots examined are based on a limited dataset and should therefore be interpreted with caution. Current AI chatbots represent an intermediate stage of development, demonstrating adequate technical proficiency while remaining constrained in addressing the humanistic dimensions of patient care. Although empathy detection models may inform future development, significant challenges in empathetic communication persist.
Authors
Keywords
No keywords available for this article.