Automated Extraction of Cancer Registry Data from Pathology Reports: Comparing LLM-Based and Ontology-Driven NLP Platforms

Journal: medRxiv

Published Date: Mar 23, 2026

Abstract

Cancer data standardization requires converting unstructured pathology reports into structured registry variables, a mostly manual and resource-intensive task. We evaluated two automated extraction platforms: Brim Analytics, an LLM-based system that guides and orchestrates abstraction, and DeepPhe, an ontology-driven system. Using 330 pancreatic adenocarcinoma and 34 breast cancer pathology reports from Johns Hopkins Hospital, we assessed both under deployment-realistic conditions. Brim Analytics achieved high accuracy across seven registry variables in pancreatic cancer (mean 96.7%), including T stage (96.4%) and histologic grade (97.0%), with a 3.0 p.p. decline on breast cancer (mean 93.7%). DeepPhe performed comparably for N stage (96.4% pancreatic, 94.1% breast) but had notable T stage deficits (83.6% pancreatic, 70.6% breast). Per-report processing times averaged 0.9 s (Brim, pancreatic), 4.6 s (Brim, breast), 1.1 s (DeepPhe, pancreatic), and 3.5 s (DeepPhe, breast). These results indicate that LLM-based extraction can achieve high accuracy across cancer types and support automated data workflows.

Authors

McPhaul
T.; Kreimeyer
K.; Baris
A.; Botsis
T.

External Resources

View on medRxiv Access via DOI

Automated Extraction of Cancer Registry Data from Pathology Reports: Comparing LLM-Based and Ontology-Driven NLP Platforms

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Automated Extraction of Cancer Registry Data from Pathology Reports: Comparing LLM-Based and Ontology-Driven NLP Platforms

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals