A Chatbot for the Management of Bipolar Disorder: Using Retrieval-Augmented Generation with an Open-Weight Large Language Model to Answer Clinical Questions Based on the CANMAT and ISBD 2018 Guidelines
Journal:
medRxiv
Published Date:
Jan 1, 2025
Abstract
Clinical practice guidelines support evidence-based care but are often underused due to complexity, time constraints, and navigation challenges. We investigated whether a conversational agent (chatbot) using an open-weight large language model (LLM) with retrieval-augmented generation (RAG) could provide guideline-consistent answers for bipolar disorder management based on the full 2018 CANMAT and ISBD guidelines, comparing against a system using only the base LLM. We developed a multi-step RAG-based chatbot that retrieves relevant guideline sections and generates responses using Llama 3.3 70B. Twenty-one clinical vignettes spanning all guideline sections were created. Six expert psychiatrists generated queries and were presented with paired responses without labels from two systems: one using the base Llama 3.3 70B model, the other RAG-enhanced. Responses rated guideline consistency on a three-point scale, and were analyzed using mixed-effects ordinal logistic regression. Experts evaluated 126 responses, of which 110 (87.3%) were rated as more or as correct as the baseline system. The RAG system produced 80 answers (63.5%) rated fully consistent with the guidelines versus 24 (19.0%) for baseline, and only 10 answers with major deviation (7.9%) versus 48 (38.1%) for baseline. Ordinal regression showed RAG responses were significantly more likely to be more correct (OR = 9.1, 95% CI 5.3–16.3, p < 0.001), which was consistent across all raters. Preference ratings favored RAG answers in 78.7% of cases. Performance varied by vignette, with some errors in both retrieval and reasoning. The use of RAG with an open-weight model helped produce answers consistent with the CANMAT guidelines across vignettes that required adapting or combining guideline text, suggesting viability of a bipolar guideline chatbot. We identified areas to improve results and evaluation. Future work should explore additional retrieval strategies and LLMs, and test in more naturalistic settings.