Genes sharing the protein family domain decrease the performance of classification with RNA-seq genomic signatures.

Journal: Biology direct
Published Date:

Abstract

BACKGROUND: The experience with running various types of classification on the CAMDA neuroblastoma dataset have led us to the conclusion that the results are not always obvious and may differ depending on type of analysis and selection of genes used for classification. This paper aims in pointing out several factors that may influence the downstream machine learning analysis. In particular those factors are: type of the primary analysis, type of the classifier and increased correlation between the genes sharing a protein domain. They influence the analysis directly, but also interplay between them may be important. We have compiled the gene-domain database and used it for analysis to see the differences between the genes that share a domain versus the rest of the genes in the datasets.

Authors

  • Anna Leśniewska
    Department of Computer Science, Poznan University of Technology, Piotrowo 2, Poznan, 60-965, Poland.
  • Joanna Zyprych-Walczak
    Department of Mathematical and Statistical Methods, Poznan University of Life Sciences, Poznan, 60-637, Poland.
  • Alicja Szabelska-Beręsewicz
    Department of Mathematical and Statistical Methods, Poznan University of Life Sciences, Poznan, 60-637, Poland.
  • Michal J Okoniewski
    Scientific IT Services, ETH Zurich, Weinbergstrasse 11, Zürich, 8092, Switzerland. michal.okoniewski@id.ethz.ch.