MISATO: machine learning dataset of protein-ligand complexes for structure-based drug discovery.

Journal: Nature computational science
PMID:

Abstract

Large language models have greatly enhanced our ability to understand biology and chemistry, yet robust methods for structure-based drug discovery, quantum chemistry and structural biology are still sparse. Precise biomolecule-ligand interaction datasets are urgently needed for large language models. To address this, we present MISATO, a dataset that combines quantum mechanical properties of small molecules and associated molecular dynamics simulations of ~20,000 experimental protein-ligand complexes with extensive validation of experimental data. Starting from the existing experimental structures, semi-empirical quantum mechanics was used to systematically refine these structures. A large collection of molecular dynamics traces of protein-ligand complexes in explicit water is included, accumulating over 170 μs. We give examples of machine learning (ML) baseline models proving an improvement of accuracy by employing our data. An easy entry point for ML experts is provided to enable the next generation of drug discovery artificial intelligence models.

Authors

  • Till Siebenmorgen
    Molecular Targets and Therapeutics Center, Institute of Structural Biology, Helmholtz Munich, Neuherberg, Germany.
  • Filipe Menezes
    Molecular Targets and Therapeutics Center, Institute of Structural Biology, Helmholtz Munich, Neuherberg, Germany.
  • Sabrina Benassou
    Jülich Supercomputing Centre, Forschungszentrum Jülich, Jülich, Germany.
  • Erinc Merdivan
    Helmholtz AI, Helmholtz Munich, Neuherberg, Germany.
  • Kieran Didi
    Computer Laboratory, Cambridge University, Cambridge, UK.
  • André Santos Dias Mourão
    Molecular Targets and Therapeutics Center, Institute of Structural Biology, Helmholtz Munich, Neuherberg, Germany.
  • Radosław Kitel
    Faculty of Chemistry, Jagiellonian University, Krakow, Poland.
  • Pietro Lió
    Computer Laboratory, University of Cambridge, 15 JJ Thomson Avenue, Cambridge, UK.
  • Stefan Kesselheim
    Jülich Supercomputing Centre, Forschungszentrum Jülich, 52428, Jülich, Germany.
  • Marie Piraud
    Department of Informatics, Technische Universität München, Munich, Germany.
  • Fabian J Theis
    Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, Munich, Germany.
  • Michael Sattler
    Molecular Targets and Therapeutics Center, Institute of Structural Biology, Helmholtz Munich, Neuherberg, Germany.
  • Grzegorz M Popowicz
    Molecular Targets and Therapeutics Center, Institute of Structural Biology, Helmholtz Munich, Neuherberg, Germany. grzegorz.popowicz@helmholtz-munich.de.