Assessing Large Language Model Utility and Limitations in Diabetes Education: A Cross-Sectional Study of Patient Interactions and Specialist Evaluations

Journal: medRxiv
Published Date:

Abstract

To assess the value of an AI-powered conversational agent in supporting diabetes self-management among adults with diabetic retinopathy and limited educational backgrounds. In this cross-sectional study, 51 adults with Type□II diabetes and diabetic retinopathy participated in moderated Q-and-A sessions with ChatGPT. Non-English-speaking and visually impaired participants interacted through trained human support. Each question– response pair was assigned to one of seven thematic categories and independently evaluated by endocrinologists and ophthalmologists using the 3C□+□2 framework (clarity, completeness, correctness, safety, recency). Inter-rater reliability was calculated with intraclass correlation coefficients (ICC) and Fleiss’□Kappa. The cohort generated 137 questions, and 98□% of the conversational agent’s answers were judged informative and empathetic. Endocrinologists awarded high mean scores for clarity (4.66/5) and completeness (4.52/5) but showed limited agreement (ICC□=□0.13 and□0.27). Ophthalmologists gave lower mean scores for clarity (3.09/5) and completeness (2.94/5) yet demonstrated stronger agreement (ICC□=□0.70 and□0.52). Reviewers detected occasional inaccuracies and hallucinations. Participants valued the agent for sensitive discussions but deferred to physicians for complex medical issues. An AI conversational agent can help bridge communication gaps in diabetes care by providing accurate, easy-to-understand answers for individuals facing language, literacy, or vision-related barriers. Nonetheless, hallucinations and variable specialist ratings underscore the need for continuous physician oversight and iterative refinement of AI outputs. Introducing conversational AI into resource-limited clinics could enhance patient education and engagement, provided that clinicians review and contextualise the advice to ensure safety, accuracy, and personalisation. Future development should prioritise reducing hallucinations and bolstering domain-specific reliability so the tool complements, rather than replaces, professional care.

Authors

  • Ghulam Mustafa; Joshua Ong; M. Zaman Shaikh; Saima Askari; Sarwat Anjum; Mohammad Idress Adhi; Abdul Sami Memon; Muhammad Uzair Abdul Rauf; Arjumand Rizvi; Imran Iqbal; Shahla Basit; Muhammad Fahadullah Khan; Muhammad Qamar Masood

Categories