Genes sharing the protein family domain decrease the performance of classification with RNA-seq genomic signatures.
Journal:
Biology direct
Published Date:
Feb 21, 2018
Abstract
BACKGROUND: The experience with running various types of classification on the CAMDA neuroblastoma dataset have led us to the conclusion that the results are not always obvious and may differ depending on type of analysis and selection of genes used for classification. This paper aims in pointing out several factors that may influence the downstream machine learning analysis. In particular those factors are: type of the primary analysis, type of the classifier and increased correlation between the genes sharing a protein domain. They influence the analysis directly, but also interplay between them may be important. We have compiled the gene-domain database and used it for analysis to see the differences between the genes that share a domain versus the rest of the genes in the datasets.