NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity.

Journal: Computational biology and chemistry
Published Date:

Abstract

Gene Ontology (GO) provides GO annotations (GOA) that associate gene products with GO terms that summarize their cellular, molecular and functional aspects in the context of biological pathways. GO Consortium (GOC) resorts to various quality assurances to ensure the correctness of annotations. Due to resources limitations, only a small portion of annotations are manually added/checked by GO curators, and a large portion of available annotations are computationally inferred. While computationally inferred annotations provide greater coverage of known genes, they may also introduce annotation errors (noise) that could mislead the interpretation of the gene functions and their roles in cellular and biological processes. In this paper, we investigate how to identify noisy annotations, a rarely addressed problem, and propose a novel approach called NoisyGOA. NoisyGOA first measures taxonomic similarity between ontological terms using the GO hierarchy and semantic similarity between genes. Next, it leverages the taxonomic similarity and semantic similarity to predict noisy annotations. We compare NoisyGOA with other alternative methods on identifying noisy annotations under different simulated cases of noisy annotations, and on archived GO annotations. NoisyGOA achieved higher accuracy than other alternative methods in comparison. These results demonstrated both taxonomic similarity and semantic similarity contribute to the identification of noisy annotations. Our study shows that annotation errors are predictable and removing noisy annotations improves the performance of gene function prediction. This study can prompt the community to study methods for removing inaccurate annotations, a critical step for annotating gene and pathway functions.

Authors

  • Chang Lu
    College of Computer and Information Science, Southwest University, Chongqing 400715, China.
  • Jun Wang
    Department of Speech, Language, and Hearing Sciences and the Department of Neurology, The University of Texas at Austin, Austin, TX 78712, USA.
  • Zili Zhang
    College of Computer and Information Science, Southwest University, Chongqing 400715, China.
  • Pengyi Yang
    Epigenetics & Stem Cell Biology Laboratory, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, Durham, North Carolina, United States of America; Biostatistics Branch, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, Durham, North Carolina, United States of America.
  • Guoxian Yu
    College of Computer and Information Science, Southwest University, Chongqing 400715, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China.