EnDM-CPP: A Multi-view Explainable Framework Based on Deep Learning and Machine Learning for Identifying Cell-Penetrating Peptides with Transformers and Analyzing Sequence Information.

Journal: Interdisciplinary sciences, computational life sciences
Published Date:

Abstract

Cell-Penetrating Peptides (CPPs) are a crucial carrier for drug delivery. Since the process of synthesizing new CPPs in the laboratory is both time- and resource-consuming, computational methods to predict potential CPPs can be used to find CPPs to enhance the development of CPPs in therapy. In this study, EnDM-CPP is proposed, which combines machine learning algorithms (SVM and CatBoost) with convolutional neural networks (CNN and TextCNN). For dataset construction, three previous CPP benchmark datasets, including CPPsite 2.0, MLCPP 2.0, and CPP924, are merged to improve the diversity and reduce homology. For feature generation, two language model-based features obtained from the Transformer architecture, including ProtT5 and ESM-2, are employed in CNN and TextCNN. Additionally, sequence features, such as CPRS, Hybrid PseAAC, KSC, etc., are input to SVM and CatBoost. Based on the result of each predictor, Logistic Regression (LR) is built to predict the final decision. The experiment results indicate that ProtT5 and ESM-2 fusion features significantly contribute to predicting CPP and that combining employed features and models demonstrates better association. On an independent test dataset comparison, EnDM-CPP achieved an accuracy of 0.9495 and a Matthews correlation coefficient of 0.9008 with an improvement of 2.23%-9.48% and 4.32%-19.02%, respectively, compared with other state-of-the-art methods. Code and data are available at https://github.com/tudou1231/EnDM-CPP.git .

Authors

  • Lun Zhu
    School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou, 213164, China.
  • Zehua Chen
    College of Big Data, Taiyuan University of Technology, JinZhong, Shanxi Province, 030600, PR China.
  • Sen Yang
    Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China.