Feature engineering from meta-data for prediction of differentially expressed genes: An investigation of Mus musculus exposed to space-conditions.
Journal:
Computational biology and chemistry
Published Date:
Feb 6, 2024
Abstract
Transcription profiling is a key process that can reveal those biological mechanisms driving the response to various exposure conditions or gene perturbations. In this work, we investigate the prediction of differentially expressed genes (DEGs) when exposed to conditions in space from a set of diverse engineered features. To do this, we collected DEGs and non-differentially expressed genes (NDEGs) of Mus musculus-based experiments on the GeneLab database. We engineered a diverse set of features from factors reported in the literature to affect gene expression. An extreme gradient boosting (XGBoost) model was trained to predict if a given gene would be differentially expressed at various levels of differential expression. The test results on a separate holdout dataset showed an area under the receiver operating characteristics curves (AUCs) of 0.90±0.07, averaged across the five selected percentages of the most and least differentially expressed genes. Subsequently, we investigated the impact of selection of features, both individually with a correlation-based feature-selection procedure and in groups with a combination procedure, on the prediction performance. The feature selection confirmed some known drivers of adaptation to radiation and highlighted some new transcription factors and micro RNAs (miRNAs). Finally, gene ontology (GO) analysis revealed biological processes that tend to have expression patterns most suitable for this approach. This work highlights the potential of detection of differentially expressed genes using a machine learning (ML) approach, and provides some evidence of gene expression changes being captured by a diverse feature set not related to the condition under study.