Machine learning for discovering missing or wrong protein function annotations : A comparison using updated benchmark datasets.

Journal: BMC bioinformatics

Published Date: Sep 23, 2019

Abstract

BACKGROUND: A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information.

Authors

Felipe Kenji Nakano

KU Leuven, Campus KULAK - Department of Public Health and Primary Care, Etienne Sabbelaan 53, Kortrijk, 8500, Belgium. felipekenji.nakano@kuleuven.be.
Mathias Lietaert

Howest University of Applied Sciences, Campus Brugge Station, Rijselstraat 5, Brugge, 8200, Belgium.
Celine Vens

Department of Computer Science, KU Leuven, Leuven, Belgium.

Keywords

Cluster Analysis Eukaryota Gene Ontology Humans Machine Learning Molecular Sequence Annotation Proteomics

External Resources

View on PubMed Access via DOI PubMed (31547800)

Machine learning for discovering missing or wrong protein function annotations : A comparison using updated benchmark datasets.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals