A Genomic Language Model for Zero-Shot Prediction of Promoter Variant Effects
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
Disease-associated genetic variants occur extensively in noncoding regions like promoters, but current methods focus primarily on single nucleotide variants (SNVs) that typically have small regulatory effect sizes. Expanding beyond single nucleotide events is essential with insertions and deletions (indels) representing the logical next step as they are readily identifiable in population data and more likely to disrupt regulatory elements. However, existing methods struggle with indel prediction, and clinical interpretation often requires assessing complete promoter haplotypes rather than individual variants. We present LOL-EVE (Language Of Life for Evolutionary Variant Effects), a conditional autoregressive transformer trained on 13.6 million mammalian promoter sequences that enables both zero-shot indel prediction and complete promoter sequence scoring. We introduce three benchmarks for promoter indel prediction: ultra rare variant prioritization, causal eQTL identification, and transcription factor binding site disruption analysis. LOL-EVE’s superior performance demonstrates that evolutionary patterns learned from indels enable accurate assessment of broader promoter function. Application to Genomics England clinical data shows that LOL-EVE can prioritize promoter haplotypes in known developmental disorder genes, suggesting potential utility for clinical variant assessment. LOL-EVE bridges individual variant prediction with haplotype-level analysis, demonstrating how evolution-based genomic language models may assist in evaluating regulatory variants in complex genetic cases.