Evaluating the Performance and Safety of Large Language Models in Generating Type 2 Diabetes Mellitus Management Plans: A Comparative Study With Physicians Using Real Patient Records.

Journal: Cureus

Published Date: Mar 17, 2025

Abstract

Background The integration of large language models (LLMs) such as GPT-4 into healthcare presents potential benefits and challenges. While LLMs show promise in applications ranging from scientific writing to personalized medicine, their practical utility and safety in clinical settings remain under scrutiny. Concerns about accuracy, ethical considerations, and bias necessitate rigorous evaluation of these technologies against established medical standards. Methods This study involved a comparative analysis using anonymized patient records from a healthcare setting in the state of West Bengal, India. Management plans for 50 patients with type 2 diabetes mellitus were generated by GPT-4 and three physicians, who were blinded to each other's responses. These plans were evaluated against a reference management plan based on American Diabetes Society guidelines. Completeness, necessity, and dosage accuracy were quantified and a Prescribing Error Score was devised to assess the quality of the generated management plans. The safety of the management plans generated by GPT-4 was also assessed. Results Results indicated that physicians' management plans had fewer missing medications compared to those generated by GPT-4 (p=0.008). However, GPT-4-generated management plans included fewer unnecessary medications (p=0.003). No significant difference was observed in the accuracy of drug dosages (p=0.975). The overall error scores were comparable between physicians and GPT-4 (p=0.301). Safety issues were noted in 16% of the plans generated by GPT-4, highlighting potential risks associated with AI-generated management plans. Conclusion The study demonstrates that while GPT-4 can effectively reduce unnecessary drug prescriptions, it does not yet match the performance of physicians in terms of plan completeness. The findings support the use of LLMs as supplementary tools in healthcare, highlighting the need for enhanced algorithms and continuous human oversight to ensure the efficacy and safety of artificial intelligence in clinical settings.

Authors

Agnibho Mondal

Department of Infectious Diseases and Advanced Microbiology, School of Tropical Medicine, Kolkata, IND.
Arindam Naskar

Department of Endocrinology, Nutrition and Metabolic Diseases, School of Tropical Medicine, Kolkata, IND.
Bhaskar Roy Choudhury

Department of Geriatric Medicine, Medical College and Hospital, Kolkata, Kolkata, IND.
Sambudhya Chakraborty

Department of Infectious Diseases, All India Institute of Medical Sciences, New Delhi, New Delhi, IND.
Tanmay Biswas

Department of Internal Medicine, Debra Super Speciality Hospital, Kolkata, IND.
Sumanta Sinha

Department of Medicine and Critical Care, Shuvadarsini Multi-Speciality Hospital, Durgapur, IND.
Sasmit Roy

Department of Nephrology, University of Virginia, Lynchburg, USA.

Keywords

No keywords available for this article.

External Resources

View on PubMed Access via DOI PubMed (40248538)

Evaluating the Performance and Safety of Large Language Models in Generating Type 2 Diabetes Mellitus Management Plans: A Comparative Study With Physicians Using Real Patient Records.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals