A large language model for predicting pancreatic ductal adenocarcinoma patients from blood-derived exosomal transcriptomics data

Journal: bioRxiv
Published Date:

Abstract

Traditional machine learning approaches for text or sequence classification rely on converting textual data into numerical representations. In this study, we investigate a reverse strategy in which numerical features are transformed into sequence representations and classified using large language models (LLMs). We applied this methodology to predict pancreatic ductal adenocarcinoma (PDAC) using the expression profiles of 50 genes from 284 PDAC and 217 non-PDAC patients. Gene expression values were converted into sequence data, with each gene represented as a residue in a 50-residue protein sequence. Major LLMs like PeptideBERT, ProtBERT, and ESM2 were fine-tuned on a protein training dataset and evaluated on an independent dataset. The best-performing model, ProtBERT, achieved an AUC of 0.962 on an independent dataset. Additionally, an alignment-based approach employing BLAST and MERCI motifs was explored, and an ensemble model combining the LLM-based and alignment-based methods was developed. Our LLM-based model outperformed traditional machine learning models. To the best of our knowledge, this is the first study demonstrating the application of LLMs for mining transcriptomic profiles of cancer patients. Identification of over and under-expressed genes in PDAC patients Convert numeric gene expression data to peptide sequence LLM based models for predicting PDAC patients using peptide sequences Mining of transcriptomics data using ProtBert and ESM2 Gene expression profile for diagnostic of PDAC patients

Authors

  • Shubham Choudhury; Naman Kumar Mehta; Gajendra P. S. Raghava