ProGen2: Exploring the boundaries of protein language models.

Journal: Cell systems

Published Date: Oct 30, 2023

Abstract

Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper's Transparent Peer Review process is included in the supplemental information.

Authors

Erik Nijkamp

Salesforce Research, Palo Alto, CA, USA.
Jeffrey A Ruffolo

Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD 21218, USA.
Eli N Weinstein

Data Science Institute, Columbia University, New York, NY, USA.
Nikhil Naik

Salesforce Research, 575 High St, Palo Alto, CA, 94301, USA. nnaik@salesforce.com.
Ali Madani

Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, 208A Stanley Hall #1762, Berkeley, CA, 94720-1762, USA.

Keywords

Amino Acid Sequence Artificial Intelligence Databases, Factual Language Proteins

External Resources

View on PubMed Access via DOI PubMed (37909046)

ProGen2: Exploring the boundaries of protein language models.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals