Edit Distance Embedding with Genomic Large Language Model

Journal: bioRxiv

Published Date: Jan 1, 2025

Abstract

Edit distance is a fundamental metric in genomic sequence analysis, yet it is computationally expensive to calculate. A practical approach for large-scale sequence analysis involves mapping sequences into a normed space and approximating the edit distance using the more efficiently computed distance in that space. This process, known as edit distance embedding, has been extensively studied both theoretically and in practice. Recently, embedding methods based on machine learning have gained popularity, where the mapping is represented as a neural network whose parameters are learned from data. However, the accuracy of these methods remains un-satisfactory, leaving much room for improvement. Recent advancements in genomic language models have shown remarkable performance in various sequence analysis applications. We investigate if improved embeddings can be achieved using DNA language models. We introduce LLMED, a model designed to produce sequence embeddings approximating the edit distance. LLMED is trained via contrastive learning based on a pretrained genomic large language model. Through extensive experimental comparisons, we show that LLMED surpasses leading machine learning and rule-based embedding methods in approximating the edit distance; LLMED also achieved significantly improved accuracy in a critical application, similar sequence search.

Authors

Xiang Li; Ke Chen; Yijia Zhang; Mingfu Shao

External Resources

View on bioRxiv Access via DOI

Edit Distance Embedding with Genomic Large Language Model

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Edit Distance Embedding with Genomic Large Language Model

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals