A Framework to Assess Clinical Safety and Hallucination Rates of LLMs for Medical Text Summarisation
Journal:
medRxiv
Published Date:
Jan 1, 2025
Abstract
Integrating large language models (LLMs) into healthcare settings can improve workflow efficiency and patient care by automating tasks such as summarising consultations. However, ensuring the fidelity between LLM outputs and ground truth information is crucial, as errors can lead to miscommunication between patients and clinicians, resulting in incorrect diagnoses, treatment decisions and compromised patient safety. We introduce a clinician-in-the-loop framework with: 1) a clinically and technically-informed error taxonomy to classify LLM outputs, 2) an experimental structure to comprehensively and iteratively compare outputs within our LLM document generation pipeline, 3) a clinical safety framework to assess potential harms of errors in LLM outputs, and 4) an encompassing graphical user interface (GUI), CREOLA, to perform and assess all previous steps. Our clinical error metrics were derived from 18 experimental configurations involving LLMs for clinical note generation consisting of 49,590 transcript and 12,999 clinical note sentences. Overall, we observed a 1.47% hallucination rate (44% rated ‘major’) and a 3.45% omission rate (17% ‘major’). Through iterative prompts and workflow refinements, we reduced major errors below previously reported human note-taking error rates, underscoring the potential of our framework to enable safer clinical documentation.