Large-scale protein function prediction using heterogeneous ensembles.

Journal: F1000Research

Published Date: Sep 28, 2018

Abstract

Heterogeneous ensembles are an effective approach in scenarios where the ideal data type and/or individual predictor are unclear for a given problem. These ensembles have shown promise for protein function prediction (PFP), but their ability to improve PFP at a large scale is unclear. The overall goal of this study is to critically assess this ability of a variety of heterogeneous ensemble methods across a multitude of functional terms, proteins and organisms. Our results show that these methods, especially Stacking using Logistic Regression, indeed produce more accurate predictions for a variety of Gene Ontology terms differing in size and specificity. To enable the application of these methods to other related problems, we have publicly shared the HPC-enabled code underlying this work as LargeGOPred ( https://github.com/GauravPandeyLab/LargeGOPred).

Authors

Linhua Wang

Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA.
Jeffrey Law

Genetics, Bioinformatics, and Computational Biology Ph.D. Program, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA.
Shiv D Kale

Biocomplexity Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA.
T M Murali

Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA.
Gaurav Pandey

Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Graduate School of Biomedical Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA. Electronic address: gaurav.pandey@mssm.edu.

Keywords

Bacterial Proteins Gene Ontology Logistic Models Machine Learning

External Resources

View on PubMed Access via DOI PubMed (30450194)

Large-scale protein function prediction using heterogeneous ensembles.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals