Beyond Annotation: Leveraging Raw RNA-seq Reads via Foundation Models for Multi-Cancer Early Detection
Journal:
medRxiv
Published Date:
Jan 1, 2025
Abstract
Early cancer detection substantially improves patient survival, yet conventional screening methods are directed at single anatomical sites and inadequately screen 45.5% of cases. Cell-free RNA (cfRNA) from blood offers a promising, non-invasive avenue for early cancer detection, reflecting real-time transcriptional activity from tumors. However, most RNA-seq pipelines focus exclusively on annotated genes, ignoring the 98% of the human genome comprising unannotated regions including noncoding RNAs, introns, and transposable elements—many dysregulated in cancer. Here we present an annotation-free foundation model framework that learns contextual cfRNA embeddings directly from raw 150bp sequencing reads. Pre-trained on 10 billion reads, our ∼2.5 billion parameter transformer model captures sequence dependencies across annotated and unannotated regions through masked nucleotide prediction and contrastive learning on overlapping fragments. Applied to multi-cancer early detection, our approach achieved high performance using plasma-based cfRNA: 89.7% AUROC for colorectal cancer, 88.6% for lung adenocarcinoma, 88.2% for esophageal squamous cell carcinoma, and 90.7% for stomach adenocarcinoma. Notably, attention analysis revealed that 30% of the most predictive features originated from unannotated regions, underscoring the diagnostic potential of the “dark transcriptome.” This approach enables scalable, reference-free liquid biopsy analysis that uncovers cancer-specific transcriptomic signals often missed by traditional, annotation-dependent pipelines