BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects

Journal: arXiv

Published Date: Jun 26, 2025

Abstract

Large language models (LLMs) trained on text demonstrated remarkable results on natural language processing (NLP) tasks. These models have been adapted to decipher the language of DNA, where sequences of nucleotides act as "words" that encode genomic functions. However, the genome differs fundamentally from natural language, as it lacks clearly defined words or a consistent grammar. Although DNA language models (DNALMs) such as DNABERT, GENA-LM have achieved high level of performance on genome-related biological tasks, these models do not encode biological functions in the presence of sequence variations. To address this problem, we pre-train foundation models that effectively integrate sequence variations, in particular Single Nucleotide Polymorphisms (SNPs), as they underlie important biological functions. Specifically, we use ModernBERT to pre-train two different Biomedical Foundation Models (BMFM), namely, BMFM-DNA-REF in which the model is trained with sequences of varying lengths along with their reverse complements derived from the reference genome and BMFM-DNA-SNP in which the model is trained with sequences created using a novel representation scheme that encodes sequence variations. Our findings indicate that integrating sequence variations into DNALMs helps capture the biological functions as seen in improvements on all fine-tuning tasks. To explore the model's practical utility, we experimented with various strategies for SNP imputation on promoter detection task introduced in DNABERT-2. However, we acknowledge that the current benchmarks are limited in their ability to fully evaluate these models. To enable more comprehensive assessment in the future and encourage community contributions, we release our models through HuggingFace and the code to reproduce the results at https://github.com/BiomedSciAI/biomed-multi-omic

Authors

Hongyang Li
Sanjoy Dey
Bum Chul Kwon
Michael Danziger
Michal Rosen-Tzvi
Jianying Hu
James Kozloski
Ching-Huei Tsou
Bharath Dandala
Pablo Meyer

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2507.05265v1)

BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals