PANCDetect: Early Detection of Pancreatic Cancer from Multimodal EHR data with LLM Embeddings

Journal: medRxiv
Published Date:

Abstract

Pancreatic cancer (PANC) is often diagnosed at late stages due to the absence of specific early symptoms, resulting in one of the highest cancer mortality rates. While imaging modalities such as MRI and CT offer high diagnostic accuracy, their population-wide application is however impractical due to the cost. Electronic health records (EHRs) provide a routine, easily accessible, longitudinal and scalable data source for risk prediction, particularly for diseases with no specific symptom such as PANC. We introduce PANCDetect, a multimodal framework that leverages large language model (LLM)-derived embeddings of diagnoses, procedures, medications, and laboratory tests, and integrates these data modalities through a Transformer-based architecture. We train the model on MarketScan (ā‰ˆ250M patients), and validate it externally on additional large real-world EHR datasets of University of Michigan Precision Health, or UMPH data (nā‰ˆ6M patients) and OneFlorida+ data (nā‰ˆ26M patients). We then fine-tuned the general model on UMPH EHR data. We evaluated the performance of both models using metrics including area under the receiver operating characteristic curve (AUROC) and area under the precision-recall-gain curve (AUPRG). We assessed the top predictive features with integrated gradients (IG). In the MarketScan cohort, PANCDetect achieved an AUROC of 0.812 and AUPRG of 0.851 at the 6-month prediction window, and an AUROC of 0.735 and AUPRG of 0.629 for 60-month prediction, significantly outperforming CancerRiskNet. External validation on UMPH and OneFlorida+ demonstrated good generalizability, with 6-months AUROC scores of 0.711 and 0.793, respectively. Fine-tuning on UMPH with laboratory data further improved performance, reaching an AUROC of 0.927 and an AUPRG of 0.979 at 6 months. Even at the 60-month horizon, the refined PANCDetect model maintained strong performance, with an AUROC of 0.835 and AUPRG of 0.911. Attribution analysis highlighted type 2 diabetes, pancreatic diseases, personal and family cancer history as the most important risk factors. PANCDetect is the state-of-the-art method integrating multimodal EHR data with LLM embeddings for accurate, interpretable, and generalizable early prediction of pancreatic cancer. This framework holds promise for precision screening of high-risk patients, with the potential to improve survival outcomes without increasing healthcare costs.

Authors

  • Zicheng Jin; Xuhui Guo; Zehua Wang; Qiang Yang; Xiaotong Yang; Xinyu Zhang; Rui Yin; Lana X. Garmire