Large-scale automated function prediction of protein sequences and an experimental case study validation on PTEN transcript variants.

Journal: Proteins
PMID:

Abstract

Recent advances in computing power and machine learning empower functional annotation of protein sequences and their transcript variations. Here, we present an automated prediction system UniGOPred, for GO annotations and a database of GO term predictions for proteomes of several organisms in UniProt Knowledgebase (UniProtKB). UniGOPred provides function predictions for 514 molecular function (MF), 2909 biological process (BP), and 438 cellular component (CC) GO terms for each protein sequence. UniGOPred covers nearly the whole functionality spectrum in Gene Ontology system and it can predict both generic and specific GO terms. UniGOPred was run on CAFA2 challenge target protein sequences and it is categorized within the top 10 best performing methods for the molecular function category. In addition, the performance of UniGOPred is higher compared to the baseline BLAST classifier in all categories of GO. UniGOPred predictions are compared with UniProtKB/TrEMBL database annotations as well. Furthermore, the proposed tool's ability to predict negatively associated GO terms that defines the functions that a protein does not possess, is discussed. UniGOPred annotations were also validated by case studies on PTEN protein variants experimentally and on CHD8 protein variants with literature. UniGOPred protein functional annotation system is available as an open access tool at http://cansyl.metu.edu.tr/UniGOPred.html.

Authors

  • Ahmet Sureyya Rifaioglu
    Department of Computer Engineering, Middle East Technical University, Ankara, 06800, Turkey.
  • Tunca Doğan
    European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK.
  • Ömer Sinan Saraç
    Department of Computer Engineering, Istanbul Technical University, İstanbul, 34467, Turkey.
  • Tulin Ersahin
    CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, 06800, Turkey.
  • Rabie Saidi
    European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK.
  • Mehmet Volkan Atalay
    Department of Computer Engineering, Middle East Technical University, Ankara, 06800, Turkey.
  • Maria Jesus Martin
    Protein Function Development Team, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom.
  • Rengul Cetin-Atalay
    CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, 06800, Turkey.