Learned protein embeddings for machine learning.

Journal: Bioinformatics (Oxford, England)
Published Date:

Abstract

MOTIVATION: Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model's ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling.

Authors

  • Kevin K Yang
    Division of Chemistry and Chemical Engineering; California Institute of Technology; Pasadena, California; United States of America.
  • Zachary Wu
    Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA.
  • Claire N Bedbrook
    Division of Biology and Biological Engineering; California Institute of Technology; Pasadena, California; United States of America.
  • Frances H Arnold
    Division of Biology and Biological Engineering; California Institute of Technology; Pasadena, California; United States of America.