Large language models for extracting histopathologic diagnoses of colorectal cancer and dysplasia from electronic health records

Journal: medRxiv
Published Date:

Abstract

Accurate data resources are essential for impactful medical research, but available structured datasets are often incomplete or inaccurate. Recent advances in open-weight large language models (LLMs) enable more accurate data extraction from unstructured text in electronic health records (EHRs) but have not yet been thoroughly validated for challenging diagnoses such as inflammatory bowel disease (IBD)-related neoplasia. Create a validated approach using LLMs for identifying histopathologic diagnoses in pathology reports from the nationwide Veterans Health Administration database, including patients with genotype data within the Million Veteran Program (MVP) biobank. Our approach utilizes simple ‘yes/no’ question prompts for following phenotypes of interest: any colorectal dysplasia, high-grade dysplasia and/or colorectal adenocarcinoma (HGD/CRC), and invasive CRC. We validated the method on diagnostic tasks by applying prompts to reports from patients with IBD (and validated separately in non-IBD) and calculated F-1 scores as a balanced accuracy measure. In patients with IBD in MVP, we achieved F1-scores of 96.1% (95% CI 92.5-99.4%) for identifying dysplasia, 93.7% (88.2-98.4%) for identifying HGD/CRC, and 98% (96.3-99.4%) for identifying CRC. In patients without IBD in MVP, we achieved F1-scores of 99.2% (98.2-100%) for identifying any colorectal dysplasia, 96.5% (93.0-99.2%) for identifying HGD/CRC, and 95% (92.8-97.2%) for identifying CRC using LLM Gemma-2. LLMs provided excellent accuracy in extracting the diagnoses of interest from EHRs. Our validated methods generalized to unstructured pathology notes, even withstanding challenges of resource-limited computing environments. This may therefore be a promising approach for other clinical phenotypes given the minimal human-led development required. Extracting structured data from free-text health records, such as pathology reports, remains a significant challenge in clinical research. Traditional natural language processing methods require extensive development and are often difficult to generalize across settings, limiting their usefulness for large-scale, reproducible data extraction. This study demonstrates that relatively small (8-9 billion parameter) publicly available large language models can accurately extract cancer and dysplasia diagnoses from pathology reports without additional task-specific training or fine-tuning. By enabling accurate data extraction from clinical text, large language models offer a scalable and accessible solution for structuring clinical data, reducing the burden of algorithm development and/or manual data curation. These advancements facilitate expanded access to high-quality real-world medical data for clinical and translational research.

Authors

  • Brian Johnson; Tyler Bath; Xinyi Huang; Mark Lamm; Ashley Earles; Hyrum Eddington; Anna M. Dornisch; Lily J. Jih; Samir Gupta; Shailja C. Shah; Kit Curtius