[Evaluation of Chinese artificial intelligence large language models in oral mucosal disease consultation].
Journal:
Zhonghua kou qiang yi xue za zhi = Zhonghua kouqiang yixue zazhi = Chinese journal of stomatology
Published Date:
Mar 6, 2026
Abstract
Objective: To investigate the current application status and potential of artificial intelligence (AI) large language models (LLMs) in oral mucosal disease health consultation. Methods: A questionnaire survey was conducted to inform the utilization of AI for oral mucosal disease-related consultations among patients attending the Department of Oral Medicine, West China Hospital of Stomatology, Sichuan University in November 2025, and to compare the factors influencing AI usage behavior and satisfaction. Nine standardized clinical questions concerning the etiology, symptoms, treatment, care, and prognosis of oral leukoplakia (OLK) were input into major LLM platforms. The responses were quantitatively scored by ten oral medicine specialists for accuracy, clarity, relevance, completeness, and practicality using the Quality Analysis of Medical Artificial Intelligence (QAMAI) tool. Concurrently, the readability of the responses was assessed using the Alpha Readability Chinese (ARC) tool. Results: A total of 200 patients with oral mucosal diseases were included. Only 37.5% (75/200) had ever used AI for related consultations. AI usage rate was significantly correlated with younger age and higher education level (P<0.001). Merely 40.0% (30/75) of users were relatively satisfied with current AI consultations, and only 21.3% (16/75) would adopt AI's treatment or care suggestions. However, 96.0% (72/75) expressed positive willingness to continue using AI for future consultations. Based on the QAMAI total scores for the nine typical OLK-related clinical questions, DeepSeek (25.4 points) and Tencent Hunyuan (25.3 points) performed best, rated as "very good quality", while the other models were rated "good quality." All models scored relatively low on the "sources and references" dimension. ARC readability analysis indicated that ByteDance Doubao had the best readability (weighted total score 0.511), while DeepSeek and Tencent Hunyuan had relatively poor readability (0.358 and 0.369, respectively). Conclusions: This study indicates that while current usage rates and satisfaction with AI consultation among patients with oral mucosal diseases need improvement, the future willingness to use it is strong. The systematic evaluation of six mainstream Chinese LLMs reveals significant disparities in their professional information quality and text readability for OLK consultation, alongside a prevalent lack of reliable evidence-based support. This underscores that enhancing the comprehensive quality of AI-generated responses is crucial for realizing its clinical application value.
Authors
Keywords
No keywords available for this article.