Learning protein fitness models from evolutionary and assay-labeled data.

Journal: Nature biotechnology

Published Date: Jan 17, 2022

Abstract

Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one probability density feature from modeling the evolutionary data. Within this approach, we find that a variational autoencoder-based probability density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.

Authors

Chloe Hsu

Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA. chloehsu@berkeley.edu.
Hunter Nisonoff

D. E. Shaw Research, New York, New York 10036, United States.
Clara Fannjiang

Research and Development, Monterey Bay Aquarium Research Institute, Moss Landing, CA 95039, USA clarafy@berkeley.edu.
Jennifer Listgarten

University of California, Berkeley, Electrical Engineering and Computer Science, Berkeley, CA, USA.

Keywords

Machine Learning Proteins

External Resources

View on PubMed Access via DOI PubMed (35039677)

Learning protein fitness models from evolutionary and assay-labeled data.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals