Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters
Journal:
arXiv
Published Date:
Mar 26, 2025
Abstract
Pulmonary embolism (PE) is a leading cause of cardiovascular mortality, yet
our understanding of optimal management remains limited due to heterogeneous
and inaccessible radiology documentation. The PERT Consortium registry
standardizes PE management data but depends on resource-intensive manual
abstraction. Large language models (LLMs) offer a scalable alternative for
automating concept extraction from computed tomography PE (CTPE) reports. This
study evaluated the accuracy of LLMs in extracting PE-related concepts compared
to a human-curated criterion standard. We retrospectively analyzed MIMIC-IV and
Duke Health CTPE reports using multiple LLaMA models. Larger models (70B)
outperformed smaller ones (8B), achieving kappa values of 0.98 (PE detection),
0.65-0.75 (PE location), 0.48-0.51 (right heart strain), and 0.65-0.70 (image
artifacts). Moderate temperature tuning (0.2-0.5) improved accuracy, while
excessive in-context examples reduced performance. A dual-model review
framework achieved >80-90% precision. LLMs demonstrate strong potential for
automating PE registry abstraction, minimizing manual workload while preserving
accuracy.