BERT and LLMs-Based avGFP Brightness Prediction and Mutation Design
Journal:
arXiv
Published Date:
Jul 30, 2024
Abstract
This study aims to utilize Transformer models and large language models (such
as GPT and Claude) to predict the brightness of Aequorea victoria green
fluorescent protein (avGFP) and design mutants with higher brightness.
Considering the time and cost associated with traditional experimental
screening methods, this study employs machine learning techniques to enhance
research efficiency. We first read and preprocess a proprietary dataset
containing approximately 140,000 protein sequences, including about 30,000
avGFP sequences. Subsequently, we constructed and trained a Transformer-based
prediction model to screen and design new avGFP mutants that are expected to
exhibit higher brightness.
Our methodology consists of two primary stages: first, the construction of a
scoring model using BERT, and second, the screening and generation of mutants
using mutation site statistics and large language models. Through the analysis
of predictive results, we designed and screened 10 new high-brightness avGFP
sequences. This study not only demonstrates the potential of deep learning in
protein design but also provides new perspectives and methodologies for future
research by integrating prior knowledge from large language models.