Predicting protein functions using positive-unlabeled ranking with ontology-based priors.

Journal: Bioinformatics (Oxford, England)
PMID:

Abstract

UNLABELLED: Automated protein function prediction is a crucial and widely studied problem in bioinformatics. Computationally, protein function is a multilabel classification problem where only positive samples are defined and there is a large number of unlabeled annotations. Most existing methods rely on the assumption that the unlabeled set of protein function annotations are negatives, inducing the false negative issue, where potential positive samples are trained as negatives. We introduce a novel approach named PU-GO, wherein we address function prediction as a positive-unlabeled ranking problem. We apply empirical risk minimization, i.e. we minimize the classification risk of a classifier where class priors are obtained from the Gene Ontology hierarchical structure. We show that our approach is more robust than other state-of-the-art methods on similarity-based and time-based benchmark datasets.

Authors

  • Fernando Zhapa-Camacho
    Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia.
  • Zhenwei Tang
    Department of Dermatology, Xiangya Hospital, Central South University, Changsha, Hunan Province, People's Republic of China.
  • Maxat Kulmanov
    Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia.
  • Robert Hoehndorf
    Computational Bioscience Research Center, King Abdullah University of Science and Technology, 4700 KAUST, Thuwal, 23955-6900, Saudi Arabia. robert.hoehndorf@kaust.edu.sa.