Can AI chatbots be reliable in dental emergencies? quality assessment of Arabic responses to dental emergency inquiries and public attitudes toward their use.
Journal:
BMC oral health
Published Date:
Jul 3, 2026
Abstract
BACKGROUND: Artificial intelligence (AI) is increasingly penetrating health and dental fields without sufficient monitoring of its quality and applicability. AIM: This study aimed to evaluate public attitudes toward AI chatbots in dental emergencies and assess the quality of Arabic-language responses generated by different AI chatbots for dental emergency inquiries. METHODS: The study had two parts. Part one: A cross-sectional online survey where 441 Saudi residents aged ≥ 18 years answered a 33-item questionnaire in Arabic that included 14 items to measure attitudes about the use of AI chatbots in dental emergencies with a 5-point Likert scale. Part two: From participant answers and oral and maxillofacial surgeons, we selected 50 dental inquiries about dental emergencies and presented them in Arabic to five AI chatbots (ChatGPT-5.1, Google Gemini 3, Claude Sonnet 4.5, Grok 1.3.40, and DeepSeek 3.2). Responses were evaluated by two calibrated oral and maxillofacial surgeons using 5-point Likert scales for accuracy, clarity, comprehensiveness, relevance, and acceptability. RESULTS: Participants showed moderately positive attitudes (2.72-3.89/5) about AI chatbots for dental emergencies. AI chatbots had generally high mean scores for accuracy (4.08-4.87), clarity (4.21-4.92), comprehensiveness (4.10-4.67), relevance (4.11-4.91), and acceptance (3.84-4.89). No significant differences were found among the AI chatbots, except Grok, which scored lower than the others on multiple quality measures (all p < 0.001). Inter-rater reliability varied across chatbots, single-measure ICC values ranging from 0.23 to 0.60; however, exact agreement was 69.8%, and 94.5% of paired ratings differed by no more than one point. CONCLUSION: Saudi public attitudes toward AI chatbots in dental emergencies were moderate. Overall, the quality of Arabic AI chatbot responses was high, although Grok had significantly lower ratings. Human supervision remains essential, and continuous "living" evaluations are needed to track rapidly evolving chatbot performance.
Authors
Keywords
No keywords available for this article.