Evaluating the performance of Large language models in rheumatology for connective tissue Diseases: DeepSeek-R1, ChatGPT-4.0, Copilot, and Gemini-2.0.

Journal: International journal of medical informatics
Published Date:

Abstract

BACKGROUND: Large language models (LLMs) demonstrate significant potential in medical information provision and may serve as valuable tools for patients seeking health information. Existing research primarily focuses on individual models or general medical inquiries, with no systematic evaluation of mainstream LLMs' performance. Particularly noteworthy is the absence of cross-comparison studies involving Chinese AI model DeepSeek-R1. This research gap may hinder the effective translation of artificial intelligence technology into clinical practice for rheumatic diseases. OBJECTIVE: This study aims to assess the accuracy, completeness, readability, and level of detail in the responses provided by LLMs to common questions related to connective tissue disease (CTD). METHODS: This cross-sectional study analyzed the responses to 250 common questions related to CTD, covering topics such as etiology and pathogenesis, risk factors, clinical manifestations, diagnostic criteria and differential diagnosis, treatment, prevention, and prognosis. These questions were collaboratively developed by three experienced clinicians and piloted by two rheumatology residents. Between February 18 and February 20, 2025, the questions were input as prompts into DeepSeek-R1, ChatGPT-4.0 (OpenAI), Copilot (Microsoft), and Gemini-2.0 (Google). The accuracy, completeness, readability, level of detail, and inclusion of health advice disclaimers in the responses were evaluated. Two experienced clinicians conducted a double-blind evaluation using four standardized scoring tools, with the average score serving as the final result. In cases of conflict or significant discrepancies in scores for the same question, the final score for each answer was determined by majority consensus. RESULTS: A total of 1000 responses (4000 scores) were generated, with an average accuracy score of 5.12 (0.78), and an average completeness score of 1.98 (0.56). The answers provided by the LLMs were "easy" to read, with an average FRES score of 80.46 (7.19). The average level of detail score was 79.38 (8.14). Overall, DeepSeek-R1 and ChatGPT-4.0 performed the best, with similar scores in accuracy, completeness, readability, and level of detail. Health advice disclaimers were included in 83%-94% of the responses. CONCLUSION: Using LLMs as tools for education and consultation in rheumatic diseases, particularly CTD shows promising potential, but the results are varied, indicating room for further improvement. DeepSeek-R1 and ChatGPT-4.0 scored similarly, performing the best in terms of accuracy, completeness, readability, and level of detail. The study results provide a basis for decision-making regarding the integration of the Chinese AI model DeepSeek-R1 into global patient education and support systems. LIMITATION & FUTURE DIRECTION: This study did not establish a mechanism to assess the dynamic updating ability of LLMs, and the rapid evolution of medical knowledge could affect the accuracy of model outputs. Furthermore, this study is limited to single-turn questions and does not simulate the progressive dialogue in real clinical scenarios. Future research should focus on further improving the accuracy, completeness, and readability of LLMs to better serve clinical practice and patient education.

Authors

Keywords

No keywords available for this article.