Large generative mRNA language foundation model for efficient coding sequence generation and design with mRNA-GPT
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
mRNA design plays a central role in synthetic biology, nucleic acid therapeutics, and vaccine development. Although large language models are applied in many biological fields, generative language models for de novo mRNA design remains largely unexplored. Here, we introduce mRNA-GPT, a series of generative mRNA language models which for the first time covers the three domains of life as pretraining datasets. Based on a GPT-2 transformer architecture with 302 million parameters, we pre-trained three separate models on 19,676 bacterial, 4,688 eukaryotic, and 702 archaeal species, leveraging 80 million, 83 million, and 2 million mRNA coding sequences, respectively. Distinct clustering of mRNA coding sequence embeddings from animals, plants, and fungi in the pretrained mRNA-GPT-eukaryote indicates that the model captures organism-specific sequence features. Following unsupervised pre-training, we fine-tuned mRNA-GPT on a translation-efficiency dataset to generate high-performance mRNA sequence. Compared to the pretrained model, the fine-tuned mRNA-GPT produced mRNA sequences with significantly higher translation efficiency scores, demonstrating the ability of mRNA-GPT to capture sequence features underlying high translation efficiency. We further fine-tuned mRNA-GPT on datasets for mRNA stability and mRNA expression, where it likewise produced high-performance mRNA sequences. Our pretrained models are publicly available, enabling other researchers to adapt mRNA-GPT to specialized tasks such as tissue-specific mRNA expression or stability by fine-tuning on their own data. Together, our study demonstrates that generative mRNA language modeling as a promising approach for accelerating mRNA design across diverse biological fields.