An integrated approach for key gene selection and cancer phenotype classification: Improving diagnosis and prediction.
Journal:
Computers in biology and medicine
Published Date:
Jul 5, 2025
Abstract
The identification of key features and reliable phenotype classification remains pivotal in cancer research, with direct implications for early diagnosis, prognosis, treatment optimization, and cost reduction in healthcare. This study introduces a hybrid model that integrates statistical and machine learning (ML) algorithms to enhance feature selection and improve classification accuracy for cancer phenotypes. Five well-known statistical tests (LIMMA, SAM, ANOVA, KW-test, and t-test) are employed to identify significant features based on statistical decision markers. The dominant features identified across both binary and multi-class datasets are then used for cancer phenotype classification using various ML methods, including LDA, LR, NB, GPC, KNN, ANN, SVM (with radial, polynomial, linear kernels), and RF. The model's robustness is validated using eight distinct microarray gene expression datasets, combined with various resampling protocols. The results show consistent improvements over previous benchmarks in the literature, with the RF classifier performing better in binary classification tasks and SVM-r demonstrating superior performance in multi-class settings. Additionally, the analysis of the bladder cancer dataset led to the identification of 13 key genes (MYH11, CCN1, FHL1, MYL9, EFEMP1, FILIP1L, RGS2, MATN2, CALD1, TNC, PALLD, ADAMTS9-AS2, and CELF2) that demonstrated strong discriminatory power. These genes were further validated through enrichment in relevant GO terms and KEGG pathways, emphasizing their diagnostic and prognostic significance. Moreover, Gene-TF and Gene-miRNA network analyses highlighted critical regulators, including TFs like CYR61, SMAD4, SOX2, TP63, and AR, along with miRNAs such as hsa-let-7b-5p, hsa-miR-34a-5p, hsa-let-7a-5p, hsa-let-7c-5p, and hsa-miR-16-5p, underscoring the functional impact of the selected features. In conclusion, the proposed approach effectively generates a streamlined set of optimal features, providing valuable biological insights and laying the groundwork for more accurate and effective tools in cancer diagnosis and prediction.
Authors
Keywords
No keywords available for this article.