A large language model for predicting neurotoxic peptides and neurotoxins.

Journal: Protein science : a publication of the Protein Society
Published Date:

Abstract

The accurate prediction of neurotoxicity in peptides and proteins is essential for the safety evaluation of therapeutic proteins and genetically modified (GM) organisms. Existing tools, including our earlier method NTxPred, typically use a single predictive model for both neurotoxic peptides and proteins, despite their structural and functional differences. This lack of specialization may lead to suboptimal performance and limited generalizability. To address this, we developed NTxPred2, distinct, specialized models for predicting neurotoxic peptides and neurotoxins (proteins). Our curated datasets include 877 neurotoxic and 877 non-toxic peptides, and 775 neurotoxic and 775 non-toxic proteins. Certain residues, like cysteine, are prevalent in both but in different magnitudes. Using composition and binary profiles, our machine-learning models achieved an area under the curve (AUC) of 0.97 for peptides and 0.85 for proteins, improving to 0.89 with evolutionary information. Models using protein embeddings reached 0.96 AUC for peptides and 0.94 for proteins, while protein language models achieved 0.98 (esm2-t30) and 0.91 (esm2-t6). All models were validated via five-fold cross-validation, and the final models were evaluated on an independent dataset. We further assessed protein models on the peptide dataset and vice versa, highlighting the necessity of separate models. The proposed models outperform existing methods on independent datasets that are not used for training. Our neurotoxicity prediction models will aid in the safety assessment of GM foods and therapeutic proteins by minimizing the need for animal testing. To support the scientific community, we developed a standalone software and web server NTxPred2 for predicting and scanning neurotoxins (https://webs.iiitd.edu.in/raghava/ntxpred2/, https://github.com/raghavagps/ntxpred2/).

Authors

  • Anand Singh Rathore
    Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.
  • Saloni Jain
    Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.
  • Shubham Choudhury
    Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.
  • Gajendra P S Raghava
    Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.