Exploiting MEDLINE for gene molecular function prediction via NMF based multi-label classification.

Journal: Journal of biomedical informatics
Published Date:

Abstract

Gene ontology (GO) provides a representation of terms and categories used to describe genes and their molecular functions, cellular components and biological processes. GO has been the standard for describing the functions of specific genes in different model organisms. GO annotation, or the tagging of genes with GO terms, has mostly been a manual and time-consuming curation process. Although many automated approaches have been proposed for annotation, few have utilized knowledge available in the literature. In this manuscript, we describe the development and evaluation of an innovative predictive system to automatically assign molecular functions (GO terms) to genes using the biomedical literature. Because genes could be associated with multiple molecular functions, we posed the GO molecular function annotation as a multi-label classification problem with several classes. We used non-negative matrix factorization (NMF) for feature reduction and then classified the genes. To address the multi-label aspect of the data, we used the binary-relevance method. Although we experimented with several classifiers, the combination of binary-relevance and K-nearest neighbor (KNN) classifier performed best. Our evaluation on UniProtKB/Swiss-Prot dataset showed the best performance of 0.84 in terms of F1-measure.

Authors

  • Samah Jamal Fodeh
    Department of Emergency Medicine, Yale Center of Medical Informatics, Suite 264F, Yale University School of Medicine, New Haven, CT, 06519-1315, USA. samah.fodeh@yale.edu.
  • Aditya Tiwari
    University of Massachusetts Amherst, United States.