Islands of Signal and Transcriptomic Sequencing: A Foundation Model for Mutation and Lineage Prediction based on DNA Methylation and RNA-seq
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
DNA methylation and RNA-seq provide complementary views of oncogenic state, but their high dimensionality complicates robust modeling. We develop a pancancer, multiomic foundation model that jointly encodes CpG-island DNA methylation and gene expression from TCGA, TARGET, CPTAC-3, and HCMI. Probe-level methylation is aggregated into CpG-island features, and RNA-seq is reduced to high-variance genes, yielding compact inputs for modality-specific MLP encoders. A BERT-like transformer with masked reconstruction and cross-modal prediction objectives learns a shared embedding space that supports missing-modality inputs. We evaluate the learned representations in two zero-shot settings: (i) cancer-type classification using a linear probe on frozen embeddings, and (ii) mutation prediction for 214 genes using a shallow MLP. The model achieves high performance for many tumor types and gene-cancer pairs without encoder finetuning. Pathway-level analyses show that hallmark oncogenic and immune programs appear as smooth gradients in the embedding space, indicating that the model captures biologically meaningful structure. These results demonstrate that combining CpG-island grouping with multiomic foundation pre-training yields compact, informative embeddings for mutation and lineage inference across cancers.