Evaluation of Clinically-Focused Artificial Intelligence Chatbots for Answering Drug Information Questions.

Journal: Journal of the American College of Clinical Pharmacy : JACCP
Published Date:

Abstract

BACKGROUND: Artificial intelligence (AI) tools are increasingly promoted for clinical decision support in health care. While studies have assessed general-purpose AI chatbots on the accuracy and quality of clinical or drug-related questions, direct comparison among multiple clinically-focused AI chatbots is lacking. This study compared the quality of responses to drug information (DI) questions from multiple clinically-focused AI chatbots with responses from DI pharmacists. METHODS: Thirty DI questions previously answered by DI pharmacy faculty were queried in four clinically-focused AI chatbots: OpenEvidence (OpenEvidence, Miami, FL), Clair (CaryHealth, Washington, DC), GlassHealth (GlassHealth, San Francisco, CA), and DougallGPT (Dougall Health, New York, NY). The general-purpose chatbot ChatGPT (Open AI, San Francisco, CA) was also queried. The quality of the responses was assessed independently by three pharmacy faculty members using the 25-point CLEAR scoring framework (completeness of content, lack of false information, evidence supporting the content, appropriateness, and relevance), a validated AI health information assessment tool. Mean total scores were categorized as either "poor" content (5-11 points on the CLEAR scale), "average" content (12-18 points), or "very good" content (19-25 points). Descriptive statistics of CLEAR scores were summarized, and Mann-Whitney U tests between faculty and AI-generated response scores were conducted. RESULTS: OpenEvidence had the highest mean total CLEAR score (17.82/25), followed by ChatGPT (15.72/25), both reflecting a content categorization of "average". Of the individual components of the CLEAR score, lack of false information consistently had the highest scores across AI chatbots, while evidence supporting the content had the lowest. When compared with faculty responses, all AI chatbots had statistically lower scores. CONCLUSION: While OpenEvidence had the highest overall CLEAR scores, few question responses from any chatbot were rated in the "very good" content category. Pharmacist expertise remains essential for ensuring answers to DI questions are of high quality.

Authors

Keywords

No keywords available for this article.