Mut-BPE: A Modified BPE Strategy Improves Variant Effect Prediction

Journal: bioRxiv
Published Date:

Abstract

Byte Pair Encoding (BPE) is widely used in genome foundation models for its ability to compress long DNA sequences into fewer tokens. However, its variable-length tokens often span multiple nucleotides, limiting the model’s sensitivity to single-nucleotide variations—an essential requirement for accurate Variant Effect Prediction (VEP). We introduce Mut-BPE, a training-free, plug-and-play tokenization strategy that augments BPE with explicit single-nucleotide resolution at variant sites. Mut-BPE preserves the efficiency of BPE while enhancing its ability to represent subtle genomic alterations. To evaluate its effectiveness, we applied Mut-BPE to DNABERT-2 and conducted extensive experiments across six diverse datasets spanning gene expression, pathogenicity, and trait-associated variants, under zero-shot and fine-tuning settings. Mut-BPE consistently outperformed conventional BPE tokenization, yielding significant improvements in both AUROC and AUPRC, particularly in imbalanced datasets. These results highlight Mut-BPE as a practical enhancement for genomic foundation models in VEP tasks. Code availability: https://anonymous.4open.science/r/Mut-BPE-1182

Authors

  • Yucheng Xu; Weicai Long; Yawen Lu; Xu Yang; Yanlin Zhang