Unsupervised feature selection algorithm for multiclass cancer classification of gene expression RNA-Seq data.

Journal: Genomics
Published Date:

Abstract

This paper presents a Grouping Genetic Algorithm (GGA) to solve a maximally diverse grouping problem. It has been applied for the classification of an unbalanced database of 801 samples of gene expression RNA-Seq data in 5 types of cancer. The samples are composed by 20,531 genes. GGA extracts several groups of genes that achieve high accuracy in multiple classification. Accuracy has been evaluated by an Extreme Learning Machine algorithm and was found to be slightly higher in balanced databases than in unbalanced ones. The final classification decision has been made through a weighted majority vote system between the groups of features. The proposed algorithm finally selects 49 genes to classify samples with an average accuracy of 98.81% and a standard deviation of 0.0174.

Authors

  • Pilar García-Díaz
    Department of Signal Theory and Communications, Polytechnic School, University of Alcalá, 28805 Alcalá de Henares, Madrid, Spain. Electronic address: pilar.garcia@uah.es.
  • Isabel Sánchez-Berriel
    Department of Computer and Systems Engineering, Higher School of Engineering and Technology, University of La Laguna, 38200 San Cristobal de La Laguna, S/C de Tenerife, Spain. Electronic address: isanchez@ull.edu.es.
  • Juan A Martínez-Rojas
    Department of Signal Theory and Communications, Polytechnic School, University of Alcalá, 28805 Alcalá de Henares, Madrid, Spain. Electronic address: juanan.martinez@uah.es.
  • Ana M Diez-Pascual
    Department of Analytical Chemistry, Physical Chemistry and Chemical Engineering, Faculty of Sciences, University of Alcalá, 28805 Alcalá de Henares, Madrid, Spain. Electronic address: am.diez@uah.es.