Large-Scale Modeling of Sparse Protein Kinase Activity Data.

Journal: Journal of chemical information and modeling
PMID:

Abstract

Protein kinases are a protein family that plays an important role in several complex diseases such as cancer and cardiovascular and immunological diseases. Protein kinases have conserved ATP binding sites, which when targeted can lead to similar activities of inhibitors against different kinases. This can be exploited to create multitarget drugs. On the other hand, selectivity (lack of similar activities) is desirable in order to avoid toxicity issues. There is a vast amount of protein kinase activity data in the public domain, which can be used in many different ways. Multitask machine learning models are expected to excel for these kinds of data sets because they can learn from implicit correlations between tasks (in this case activities against a variety of kinases). However, multitask modeling of sparse data poses two major challenges: (i) creating a balanced train-test split without data leakage and (ii) handling missing data. In this work, we construct a protein kinase benchmark set composed of two balanced splits without data leakage, using random and dissimilarity-driven cluster-based mechanisms, respectively. This data set can be used for benchmarking and developing protein kinase activity prediction models. Overall, the performance on the dissimilarity-driven cluster-based split is lower than on random split-based sets for all models, indicating poor generalizability of models. Nevertheless, we show that multitask deep learning models, on this very sparse data set, outperform single-task deep learning and tree-based models. Finally, we demonstrate that data imputation does not improve the performance of (multitask) models on this benchmark set.

Authors

  • Sohvi Luukkonen
    ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, A-4040 Linz, Austria.
  • Erik Meijer
    Leiden Academic Centre of Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands.
  • Giovanni A Tricarico
    Galapagos NV, Generaal De Wittelaan L11 A3, 2800 Mechelen, Belgium.
  • Johan Hofmans
    Galapagos NV, Generaal De Wittelaan L11 A3, 2800 Mechelen, Belgium.
  • Pieter F W Stouten
    Galapagos NV, Generaal De Wittelaan L11 A3, 2800, Mechelen, Belgium.
  • Gerard J P van Westen
    Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, the Netherlands. Electronic address: gerard@lacdr.leidenuniv.nl.
  • Eelke B Lenselink
    Galapagos NV, Generaal De Wittelaan L11 A3, 2800, Mechelen, Belgium. bart.lenselink@glpg.com.