BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects
Journal:
arXiv
Published Date:
Jun 26, 2025
Abstract
Large language models (LLMs) trained on text demonstrated remarkable results
on natural language processing (NLP) tasks. These models have been adapted to
decipher the language of DNA, where sequences of nucleotides act as "words"
that encode genomic functions. However, the genome differs fundamentally from
natural language, as it lacks clearly defined words or a consistent grammar.
Although DNA language models (DNALMs) such as DNABERT, GENA-LM have achieved
high level of performance on genome-related biological tasks, these models do
not encode biological functions in the presence of sequence variations. To
address this problem, we pre-train foundation models that effectively integrate
sequence variations, in particular Single Nucleotide Polymorphisms (SNPs), as
they underlie important biological functions. Specifically, we use ModernBERT
to pre-train two different Biomedical Foundation Models (BMFM), namely,
BMFM-DNA-REF in which the model is trained with sequences of varying lengths
along with their reverse complements derived from the reference genome and
BMFM-DNA-SNP in which the model is trained with sequences created using a novel
representation scheme that encodes sequence variations. Our findings indicate
that integrating sequence variations into DNALMs helps capture the biological
functions as seen in improvements on all fine-tuning tasks. To explore the
model's practical utility, we experimented with various strategies for SNP
imputation on promoter detection task introduced in DNABERT-2. However, we
acknowledge that the current benchmarks are limited in their ability to fully
evaluate these models. To enable more comprehensive assessment in the future
and encourage community contributions, we release our models through
HuggingFace and the code to reproduce the results at
https://github.com/BiomedSciAI/biomed-multi-omic