DNAHLM -- DNA sequence and Human Language mixed large language Model

Journal: arXiv

Published Date: Oct 22, 2024

Abstract

There are already many DNA large language models, but most of them still follow traditional uses, such as extracting sequence features for classification tasks. More innovative applications of large language models, such as prompt engineering, RAG, and zero-shot or few-shot prediction, remain challenging for DNA-based models. The key issue lies in the fact that DNA models and human natural language models are entirely separate; however, techniques like prompt engineering require the use of natural language, thereby significantly limiting the application of DNA large language models. This paper introduces a pre-trained model trained on the GPT-2 network, combining DNA sequences and English text, and uses a unified BPE tokenization method. We then convert classification and other downstream tasks into Alpaca format instruction data, and perform instruction fine-tuning on this pre-trained model to create a fine-tuned model capable of handling multiple tasks. The model has demonstrated its effectiveness in DNA related zero-shot prediction and multitask application. This research provides a highly promising direction for building a unified DNA sequence task framework.

Authors

Wang Liang

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2410.16917v2)

DNAHLM -- DNA sequence and Human Language mixed large language Model

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

DNAHLM -- DNA sequence and Human Language mixed large language Model

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals