Extended dipeptide composition framework for accurate identification of anticancer peptides.

Journal: Scientific reports
PMID:

Abstract

The identification of anticancer peptides (ACPs) is crucial, especially in the development of peptide-based cancer therapy. The classical models such as Split Amino Acid Composition (SAAC) and Pseudo Amino Acid Composition (PseAAC) lack the incorporation of feature representation. These advancements improve the predictive accuracy and efficiency of ACP identification. Thus, the effort of this research is to propose and develop an advanced framework based on feature extraction. Thus, to achieve this objective herein we propose an Extended Dipeptide Composition (EDPC) framework. The proposed EDPC framework extends the dipeptide composition by considering the local sequence environment information and reforming the CD-HIT framework to remove noise and redundancy. To measure the accuracy, we have performed several experiments. These experiments were employed using four famous machine learning (ML) algorithms named; Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), and K Nearest Neighbor (KNN). For comparisons, we have used accuracy, specificity, sensitivity, precision, recall, and F1-Score as evaluation criteria. The reliability of the proposed framework is further evaluated using statistical significance tests. As a result, the proposed EDPC framework exhibited enhanced performance than SAAC and PseAAC, where the SVM model delivered the highest accuracy of 96. 6% and significant enhancements in specificity, sensitivity, precision, and F1-score over multiple datasets. Due to the incorporation of enhanced feature representation and the incorporation of local and global sequence profiles proposed EDPC achieves higher classification performance. The proposed frameworks can deal with noise and also duplicating features. These are accompanied by a wide range of feature representations. Finally, our proposed framework can be used for clinical applications where ACP identification is essential. Future works will include extending to a larger variety of datasets, incorporating tertiary structural information, and using deep learning techniques to improve the proposed EDPC.

Authors

  • Faizan Ullah
    Department of Computer Science, Bacha Khan University, Charsadda, 24420, Pakistan.
  • Abdu Salam
    Department of Computer Science, Abdul Wali Khan University, Mardan, 23200, Pakistan.
  • Muhammad Nadeem
    Department of Computer Science, Faculty of Computing and Information Technology, International Islamic University, Islamabad, Punjab, Pakistan.
  • Farhan Amin
    School of Computer Science and Engineering, Yeungnam University, Gyeongsan, 38541, Korea. farhanamin10@hotmail.com.
  • Hussain AlSalman
    Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia.
  • Mohammad Abrar
    Faculty of Computer Studies, Arab Open University, Muscat, Oman.
  • Taha Alfakih
    Centre of Smart Robotics Research (CS2R), King Saud University, Riyadh 11543, Saudi Arabia.