Leveraging large language models for structured information extraction from pathology reports
Journal:
arXiv
Published Date:
Feb 14, 2025
Abstract
Background: Structured information extraction from unstructured
histopathology reports facilitates data accessibility for clinical research.
Manual extraction by experts is time-consuming and expensive, limiting
scalability. Large language models (LLMs) offer efficient automated extraction
through zero-shot prompting, requiring only natural language instructions
without labeled data or training. We evaluate LLMs' accuracy in extracting
structured information from breast cancer histopathology reports, compared to
manual extraction by a trained human annotator.
Methods: We developed the Medical Report Information Extractor, a web
application leveraging LLMs for automated extraction. We developed a gold
standard extraction dataset to evaluate the human annotator alongside five LLMs
including GPT-4o, a leading proprietary model, and the Llama 3 model family,
which allows self-hosting for data privacy. Our assessment involved 111
histopathology reports from the Breast Cancer Now (BCN) Generations Study,
extracting 51 pathology features specified in the study's data dictionary.
Results: Evaluation against the gold standard dataset showed that both Llama
3.1 405B (94.7% accuracy) and GPT-4o (96.1%) achieved extraction accuracy
comparable to the human annotator (95.4%; p = 0.146 and p = 0.106,
respectively). While Llama 3.1 70B (91.6%) performed below human accuracy (p
<0.001), its reduced computational requirements make it a viable option for
self-hosting.
Conclusion: We developed an open-source tool for structured information
extraction that can be customized by non-programmers using natural language.
Its modular design enables reuse for various extraction tasks, producing
standardized, structured data from unstructured text reports to facilitate
analytics through improved accessibility and interoperability.