Learned protein embeddings for machine learning.

Journal: Bioinformatics (Oxford, England)

Published Date: Aug 1, 2018

Abstract

MOTIVATION: Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model's ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling.

Authors

Kevin K Yang

Division of Chemistry and Chemical Engineering; California Institute of Technology; Pasadena, California; United States of America.
Zachary Wu

Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA.
Claire N Bedbrook

Division of Biology and Biological Engineering; California Institute of Technology; Pasadena, California; United States of America.
Frances H Arnold

Division of Biology and Biological Engineering; California Institute of Technology; Pasadena, California; United States of America.

Keywords

Amino Acid Sequence Bacteria Computational Biology Eukaryota Humans Machine Learning Models, Biological Proteins Sequence Analysis, Protein Software

External Resources

View on PubMed Access via DOI PubMed (29584811)

Learned protein embeddings for machine learning.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals