KaMLs for Predicting Protein p Values and Ionization States: Are Trees All You Need?

Journal: Journal of chemical theory and computation
PMID:

Abstract

Despite its importance in understanding biology and computer-aided drug discovery, the accurate prediction of protein ionization states remains a formidable challenge. Physics-based approaches struggle to capture the small, competing contributions in the complex protein environment, while machine learning (ML) is hampered by the scarcity of experimental data. Here, we report the development of p ML (KaML) models based on decision trees and graph attention networks (GAT), exploiting physicochemical understanding and a new experiment p database (PKAD-3) enriched with highly shifted p's. KaML-CBtree significantly outperforms the current state of the art in predicting p values and ionization states across all six titratable amino acids, notably achieving accurate predictions for deprotonated cysteines and lysines─a blind spot in previous models. The superior performance of KaMLs is achieved in part through several innovations, including the separate treatment of acid and base, data augmentation using AlphaFold structures, and model pretraining on a theoretical p database. We also introduce the classification of protonation states as a metric for evaluating p prediction models. A meta-feature analysis suggests a possible reason for the lightweight tree model to outperform the more complex deep learning GAT. We release an end-to-end p predictor based on KaML-CBtree and the new PKAD-3 database, which facilitates a variety of applications and provides the foundation for further advances in protein electrostatic research.

Authors

  • Mingzhe Shen
    Department of Pharmaceutical Sciences and Computational Chemical Genomics Screening Center, School of Pharmacy; National Center of Excellence for Computational Drug Abuse Research; Drug Discovery Institute; and Departments of Computational Biology and Structural Biology, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, United States.
  • Daniel Kortzak
    Department of Pharmaceutical Sciences, University of Maryland School of Pharmacy, Baltimore, Maryland 21201, United States.
  • Simon Ambrozak
    Department of Computer Science, University of Maryland College Park, College Park, Maryland 20742, United States.
  • Shubham Bhatnagar
    Department of Computer Science, University of Maryland College Park, College Park, Maryland 20742, United States.
  • Ian Buchanan
    Department of Neurological Surgery, Keck School of Medicine of University of Southern California, 1200 North State St., Suite 3300, Los Angeles, CA, 90033, USA.
  • Ruibin Liu
    School of Physics, Beijing Institute of Technology, Beijing, China. liusir@bit.edu.cn.
  • Jana Shen
    Department of Pharmaceutical Sciences, University of Maryland School of Pharmacy, Baltimore, Maryland 21201, United States.