Batch-Harmonized Machine Learning Framework for Cross-Cohort RNA Biomarker Discovery in Pancreatic Adenocarcinoma

Journal: bioRxiv
Published Date:

Abstract

Pancreatic ductal adenocarcinoma (PDAC) lacks reliable prognostic biomarkers. RNA-based signatures suffer from poor reproducibility due to batch effects and platform heterogeneity between microarray and RNA-seq data, limiting machine learning applications. We developed a computational pipeline harmonizing RNA-seq data from multiple repositories using ComBat batch correction, followed by Random Forest and XGBoost classification. Restricting analysis to RNA-seq platforms only, we achieved 14,137 common genes between TCGA-PAAD (n=178) and validation cohort GSE71729 (n=357). We quantified batch correction efficacy via silhouette coefficients and trained models on survival outcomes. ComBat correction eliminated dataset-specific clustering (silhouette coefficient: 0.866→-0.012). Random Forest achieved 64% training accuracy, identifying five prognostic biomarkers: LAMC2, DKK1, ITGB6, GPRC5A, and MAL2. These genes showed consistent importance across models and biological relevance to invasion, epithelial-mesenchymal transition, and tumor suppression. Models successfully generalized independent validation data. We present the first open-source R pipeline optimized for RNA-seq-based, cross-cohort biomarker discovery in pancreatic cancer. Platform-matched datasets yielded superior gene coverage versus multi-platform approaches, enabling robust machine learning classification. Our framework identifies five novel prognostic genes and provides a reproducible method for multi-center RNA biomarker studies, available through an interactive Shiny application. All code, processed data, and the interactive Shiny application are available at https://github.com/MarkBarsoumMarkarian/rna-harmonization-ai

Authors

  • Mark Barsoum Markarian