Large-scale protein function prediction using heterogeneous ensembles.

Journal: F1000Research
Published Date:

Abstract

Heterogeneous ensembles are an effective approach in scenarios where the ideal data type and/or individual predictor are unclear for a given problem. These ensembles have shown promise for protein function prediction (PFP), but their ability to improve PFP at a large scale is unclear. The overall goal of this study is to critically assess this ability of a variety of heterogeneous ensemble methods across a multitude of functional terms, proteins and organisms. Our results show that these methods, especially Stacking using Logistic Regression, indeed produce more accurate predictions for a variety of Gene Ontology terms differing in size and specificity. To enable the application of these methods to other related problems, we have publicly shared the HPC-enabled code underlying this work as LargeGOPred ( https://github.com/GauravPandeyLab/LargeGOPred).

Authors

  • Linhua Wang
    Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA.
  • Jeffrey Law
    Genetics, Bioinformatics, and Computational Biology Ph.D. Program, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA.
  • Shiv D Kale
    Biocomplexity Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA.
  • T M Murali
    Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA.
  • Gaurav Pandey
    Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Graduate School of Biomedical Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA. Electronic address: gaurav.pandey@mssm.edu.