A Foundation Model for the Cancer Genome
Journal:
bioRxiv
Published Date:
Jun 1, 2026
Abstract
Cancer is a disease of the genome, in which somatic mutations and copy-number alterations determine tumour identity, clinical behaviour, and response to therapy. Consortium-scale sequencing has profiled hundreds of thousands of tumours, yet clinical interpretation still proceeds one alteration at a time against hand-curated knowledgebases, often ignoring co-occurring alterations and the genome-wide copy-number pattern. Self-supervised foundation models pretrained on unlabelled corpora have produced transferable representations in adjacent biological domains by learning joint structure across many features, yet no comparable model exists for the cancer genome. Here we present TESSERA (Tumour Embeddings via Self-Supervised Encoding and Reconstruction of Alterations), a foundation model for the cancer genome; we pretrain it on somatic single-nucleotide variants and copy-number segments through masked-token reconstruction within each modality and a contrastive objective across modalities. A single representation, produced once and reused without retraining, supports variant pathogenicity prediction, pan-cancer tumour typing, unsupervised molecular subtyping, prognostic stratification, and counterfactual treatment-effect estimation that yields predictive chemotherapy-selection biomarkers in real-world cohorts. These biomarkers are interpretable: each surfaces the co-occurring alterations underlying the prediction, exposing biology that single-gene rules miss. In metastatic colorectal cancer, where the FOLFOX-vs-FOLFIRI choice is currently guided by toxicity rather than tumour biology, the model uncovers a candidate predictive biomarker: a three-feature rule (TP53+/KRAS+/17p-) selecting patients who derive substantially greater benefit from FOLFOX than FOLFIRI.