Edit Distance Embedding with Genomic Large Language Model

Journal: bioRxiv
Published Date:

Abstract

Edit distance is a fundamental metric in genomic sequence analysis, yet it is computationally expensive to calculate. A practical approach for large-scale sequence analysis involves mapping sequences into a normed space and approximating the edit distance using the more efficiently computed distance in that space. This process, known as edit distance embedding, has been extensively studied both theoretically and in practice. Recently, embedding methods based on machine learning have gained popularity, where the mapping is represented as a neural network whose parameters are learned from data. However, the accuracy of these methods remains un-satisfactory, leaving much room for improvement. Recent advancements in genomic language models have shown remarkable performance in various sequence analysis applications. We investigate if improved embeddings can be achieved using DNA language models. We introduce LLMED, a model designed to produce sequence embeddings approximating the edit distance. LLMED is trained via contrastive learning based on a pretrained genomic large language model. Through extensive experimental comparisons, we show that LLMED surpasses leading machine learning and rule-based embedding methods in approximating the edit distance; LLMED also achieved significantly improved accuracy in a critical application, similar sequence search.

Authors

  • Xiang Li; Ke Chen; Yijia Zhang; Mingfu Shao