Sequence-Based Prioritization of Promoter Regulatory Variants in Colorectal Cancer Using a DNA Foundation Model
Journal:
bioRxiv
Published Date:
May 28, 2026
Abstract
Noncoding regulatory variants contribute to colorectal cancer (CRC) susceptibility, yet their functional interpretation remains difficult.This is mainly attributed to regulatory effects being context-dependent and most noncoding regions lack reliable genomic annotations. We have developed a computational framework that aids in prioritizing promoter-associated variants using Evo2, a large-scale autoregressive DNA foundation model. In the framework, variants were mapped to promoter regions across ~1,250 CRC-associated genes and scored using Evo2-derived delta scores, the difference in sequence probability between reference and alternate alleles. Promoter variants showed greater predicted regulatory impact than non-promoter variants (median delta = 0.015 vs. 0.002; overall mean = 0.018, SD = 0.011). Applying a distributional threshold (delta > 0.020; top ~25%) identified 287 high-impact variants across 198 CRC-associated genes. These genes were enriched in CRC-relevant pathways such as Wnt signaling, p53 signaling, and cell cycle regulation and 36.4% (72/198) overlapped known cancer genes. Independent validation showed high-impact variants were enriched at CRC GWAS loci and overlapped transcription factor binding sites (~32%) and motif-disrupting positions (~21%), supporting their functional relevance. Together, these results show that sequence-based foundation models can scalably prioritize noncoding regulatory candidates in CRC without supervised training or predefined annotations.