Assessing the Accuracy and Reliability of ChatGPT-4 to Answer Clinical EHR Messages in Sports Medicine.

Journal: Southern medical journal
Published Date:

Abstract

OBJECTIVES: Although advancements in electronic health records (EHRs) have improved clinical productivity, digital administrative responsibilities have led to increased physician burnout. With the emergence of large language models (LLMs), their incorporation into medicine is a potential solution to the increase in tasks such as charting and responding to patient messages. Previous studies have evaluated the efficacy of LLMs such as Chat Generative Pre-Trained Transformer-4 (ChatGPT-4) in clinical knowledge-based questions. Few studies, however, have evaluated the responses to clinical decision making in sports medicine. This study aims to evaluate the efficiency and clinical accuracy of ChatGPT-4 responses to common sports medicine questions that patients ask in the EHR system. METHODS: ChatGPT-4 was prompted with few-shot exemplars involving different sports medicine injuries to generate 80 EHR scenarios. Next, ChatGPT-4 was programmed to respond to the 80 EHR scenarios using the created programmed approaches to generate LLM drafts. In stage 1, four board-certified orthopedic surgeons were asked to respond to the EHR responses, followed by a survey evaluating the difficulty and urgency of the situation. In stage 2, they were asked to edit the LLM drafts so that they were clinically acceptable to send to a patient. RESULTS: In stage 1, the assessing physicians found responding to the LLM clinical question to be trivial in 60 out of 80 cases (75%). Most physicians disagreed that the patients in the LLM drafts were experiencing a severe medical event in 58 out of 80 cases (72.50%). In stage 2, the physicians rated the LLM-assisted responses as acceptable without modifications in 58 out of 80 cases (72.50%). Furthermore, the physicians agreed that the unedited LLM-assisted responses had a low chance of causing harm in 75 out of 80 cases (93.75%). Finally, the physicians rated the responses as generated by artificial intelligence in 65 out of 80 cases (81.25%). CONCLUSIONS: Surgeons rated the majority of the LLM responses as both clinically accurate and time-saving, with a low risk of causing harm. This finding suggests that LLMs have the potential to provide adequate responses to EHR messages within the field of sports medicine, potentially lessening physician burden and workload.

Authors

Keywords

No keywords available for this article.