CREMP: Conformer-rotamer ensembles of macrocyclic peptides for machine learning.

Journal: Scientific data
PMID:

Abstract

Computational and machine learning approaches to model the conformational landscape of macrocyclic peptides have the potential to enable rational design and optimization. However, accurate, fast, and scalable methods for modeling macrocycle geometries remain elusive. Recent deep learning approaches have significantly accelerated protein structure prediction and the generation of small-molecule conformational ensembles, yet similar progress has not been made for macrocyclic peptides due to their unique properties. Here, we introduce CREMP, a resource generated for the rapid development and evaluation of machine learning models for macrocyclic peptides. CREMP contains 36,198 unique macrocyclic peptides and their high-quality structural ensembles generated using the Conformer-Rotamer Ensemble Sampling Tool (CREST). Altogether, this new dataset contains nearly 31.3 million unique macrocycle geometries, each annotated with energies derived from semi-empirical extended tight-binding (xTB) DFT calculations. Additionally, we include 3,258 macrocycles with reported passive permeability data to couple conformational ensembles to experiment. We anticipate that this dataset will enable the development of machine learning models that can improve peptide design and optimization for novel therapeutics.

Authors

  • Colin A Grambow
    Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.
  • Hayley Weir
    Prescient Design, Genentech, 1 DNA Way, South San Francisco, CA, 94080, USA.
  • Christian N Cunningham
    Department of Peptide Therapeutics, Genentech, 1 DNA Way, South San Francisco, CA, 94080, USA.
  • Tommaso Biancalani
    Broad Institute of MIT and Harvard, Cambridge, MA, USA. tommaso.biancalani@gmail.com.
  • Kangway V Chuang
    Department of Pharmaceutical Chemistry, Department of Bioengineering and Therapeutic Sciences, Institute for Neurodegenerative Diseases and Bakar Institute for Computational Health Sciences , University of California-San Francisco , 675 Nelson Rising Lane , San Francisco , California 94158 , United States.