Machine learning analysis of molecular dynamics properties influencing drug solubility.
Journal:
Scientific reports
Published Date:
Jul 24, 2025
Abstract
Solubility is critical in drug discovery and development, as it significantly influences a medication's bioavailability and therapeutic efficacy. Understanding solubility at the early stages of drug discovery is essential for minimizing resource consumption and enhancing the likelihood of clinical success via prioritizing compounds with optimal solubility. Molecular dynamics (MD) simulation is a powerful computational tool for modeling various physicochemical properties, particularly solubility. MD simulations offer a detailed perspective on molecular interactions and dynamics, providing insights into the factors influencing solubility. This study aims to statistically examine the impact of ten MD-derived properties, along with octanol-water partition coefficient (logP), one of the most influential experimental properties, on the aqueous solubility of drugs using Machine Learning (ML) techniques. To achieve this, a dataset comprising 211 drugs from diverse classes was compiled from the literature. These drugs were subjected to MD simulation, from which relevant properties were extracted and selected as features. Additionally, the corresponding logP from previous studies was incorporated into the analysis. Through rigorous analysis, the properties with the most significant influence on solubility were identified and subsequently used as input features for four ensemble machine learning algorithms: Random Forest, Extra Trees, XGBoost, and Gradient Boosting. The results indicate that seven properties, logP, Solvent Accessible Surface Area (SASA), Coulombic_t, LJ, Estimated Solvation Free energies (DGSolv), Root Mean Square Deviation (RMSD), and Average number of solvents in Solvation Shell (AvgShell) are highly effective in predicting solubility, exhibiting performance comparable to predictive models based on structural features. The Gradient Boosting algorithm achieved the best performance with a predictive R of 0.87 and an RMSE of 0.537 in test set. This research underscores the potential of integrating MD simulations with ML methodologies to improve the accuracy and efficiency of aqueous solubility predictions in drug development.