DNAHLM -- DNA sequence and Human Language mixed large language Model
Journal:
arXiv
Published Date:
Oct 22, 2024
Abstract
There are already many DNA large language models, but most of them still
follow traditional uses, such as extracting sequence features for
classification tasks. More innovative applications of large language models,
such as prompt engineering, RAG, and zero-shot or few-shot prediction, remain
challenging for DNA-based models. The key issue lies in the fact that DNA
models and human natural language models are entirely separate; however,
techniques like prompt engineering require the use of natural language, thereby
significantly limiting the application of DNA large language models. This paper
introduces a pre-trained model trained on the GPT-2 network, combining DNA
sequences and English text, and uses a unified BPE tokenization method. We then
convert classification and other downstream tasks into Alpaca format
instruction data, and perform instruction fine-tuning on this pre-trained model
to create a fine-tuned model capable of handling multiple tasks. The model has
demonstrated its effectiveness in DNA related zero-shot prediction and
multitask application. This research provides a highly promising direction for
building a unified DNA sequence task framework.