Large generative mRNA language foundation model for efficient coding sequence generation and design with mRNA-GPT

Journal: bioRxiv

Published Date: Jan 1, 2025

Abstract

mRNA design plays a central role in synthetic biology, nucleic acid therapeutics, and vaccine development. Although large language models are applied in many biological fields, generative language models for de novo mRNA design remains largely unexplored. Here, we introduce mRNA-GPT, a series of generative mRNA language models which for the first time covers the three domains of life as pretraining datasets. Based on a GPT-2 transformer architecture with 302 million parameters, we pre-trained three separate models on 19,676 bacterial, 4,688 eukaryotic, and 702 archaeal species, leveraging 80 million, 83 million, and 2 million mRNA coding sequences, respectively. Distinct clustering of mRNA coding sequence embeddings from animals, plants, and fungi in the pretrained mRNA-GPT-eukaryote indicates that the model captures organism-specific sequence features. Following unsupervised pre-training, we fine-tuned mRNA-GPT on a translation-efficiency dataset to generate high-performance mRNA sequence. Compared to the pretrained model, the fine-tuned mRNA-GPT produced mRNA sequences with significantly higher translation efficiency scores, demonstrating the ability of mRNA-GPT to capture sequence features underlying high translation efficiency. We further fine-tuned mRNA-GPT on datasets for mRNA stability and mRNA expression, where it likewise produced high-performance mRNA sequences. Our pretrained models are publicly available, enabling other researchers to adapt mRNA-GPT to specialized tasks such as tissue-specific mRNA expression or stability by fine-tuning on their own data. Together, our study demonstrates that generative mRNA language modeling as a promising approach for accelerating mRNA design across diverse biological fields.

Authors

Bian Bian; Yiming Zhang; Hongmin Li; Jiuzhou Zhong; Yutaka Saito

External Resources

View on bioRxiv Access via DOI

Large generative mRNA language foundation model for efficient coding sequence generation and design with mRNA-GPT

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Large generative mRNA language foundation model for efficient coding sequence generation and design with mRNA-GPT

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals