VeriFact: Verifying Facts in LLM-Generated Clinical Text with Electronic Health Records

Journal: arXiv
Published Date:

Abstract

Methods to ensure factual accuracy of text generated by large language models (LLM) in clinical medicine are lacking. VeriFact is an artificial intelligence system that combines retrieval-augmented generation and LLM-as-a-Judge to verify whether LLM-generated text is factually supported by a patient's medical history based on their electronic health record (EHR). To evaluate this system, we introduce VeriFact-BHC, a new dataset that decomposes Brief Hospital Course narratives from discharge summaries into a set of simple statements with clinician annotations for whether each statement is supported by the patient's EHR clinical notes. Whereas highest agreement between clinicians was 88.5%, VeriFact achieves up to 92.7% agreement when compared to a denoised and adjudicated average human clinican ground truth, suggesting that VeriFact exceeds the average clinician's ability to fact-check text against a patient's medical record. VeriFact may accelerate the development of LLM-based EHR applications by removing current evaluation bottlenecks.

Authors

  • Philip Chung
  • Akshay Swaminathan
  • Alex J. Goodell
  • Yeasul Kim
  • S. Momsen Reincke
  • Lichy Han
  • Ben Deverett
  • Mohammad Amin Sadeghi
  • Abdel-Badih Ariss
  • Marc Ghanem
  • David Seong
  • Andrew A. Lee
  • Caitlin E. Coombes
  • Brad Bradshaw
  • Mahir A. Sufian
  • Hyo Jung Hong
  • Teresa P. Nguyen
  • Mohammad R. Rasouli
  • Komal Kamra
  • Mark A. Burbridge
  • James C. McAvoy
  • Roya Saffary
  • Stephen P. Ma
  • Dev Dash
  • James Xie
  • Ellen Y. Wang
  • Clifford A. Schmiesing
  • Nigam Shah
  • Nima Aghaeepour