Informed training set design enables efficient machine learning-assisted directed protein evolution.

Journal: Cell systems
Published Date:

Abstract

Directed evolution of proteins often involves a greedy optimization in which the mutation in the highest-fitness variant identified in each round of single-site mutagenesis is fixed. The efficiency of such a single-step greedy walk depends on the order in which beneficial mutations are identified-the process is path dependent. Here, we investigate and optimize a path-independent machine learning-assisted directed evolution (MLDE) protocol that allows in silico screening of full combinatorial libraries. In particular, we evaluate the importance of different protein encoding strategies, training procedures, models, and training set design strategies on MLDE outcome, finding the most important consideration to be the implementation of strategies that reduce inclusion of minimally informative "holes" (protein variants with zero or extremely low fitness) in training data. When applied to an epistatic, hole-filled, four-site combinatorial fitness landscape, our optimized protocol achieved the global fitness maximum up to 81-fold more frequently than single-step greedy optimization. A record of this paper's transparent peer review process is included in the supplemental information.

Authors

  • Bruce J Wittmann
    Division of Biology and Bioengineering, California Institute of Technology, Pasadena, CA 91125.
  • Yisong Yue
    Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA 91125, USA.
  • Frances H Arnold
    Division of Biology and Biological Engineering; California Institute of Technology; Pasadena, California; United States of America.