Normalization and selecting non-differentially expressed genes improve machine learning modelling of cross-platform transcriptomic data
Journal:
arXiv
Published Date:
Jan 24, 2025
Abstract
Normalization is a critical step in quantitative analyses of biological
processes. Recent works show that cross-platform integration and normalization
enable machine learning (ML) training on RNA microarray and RNA-seq data, but
no independent datasets were used in their studies. Therefore, it is unclear
how to improve ML modelling performance on independent RNA array and RNA-seq
based datasets. Inspired by the house-keeping genes that are commonly used in
experimental biology, this study tests the hypothesis that non-differentially
expressed genes (NDEG) may improve normalization of transcriptomic data and
subsequently cross-platform modelling performance of ML models. Microarray and
RNA-seq datasets of the TCGA breast cancer were used as independent training
and test datasets, respectively, to classify the molecular subtypes of breast
cancer. NDEG (p>0.85) and differentially expressed genes (DEG, p<0.05) were
selected based on the p values of ANOVA analysis and used for subsequent data
normalization and classification, respectively. Models trained based on data
from one platform were used for testing on the other platform. Our data show
that NDEG and DEG gene selection could effectively improve the model
classification performance. Normalization methods based on parametric
statistical analysis were inferior to those based on nonparametric statistics.
In this study, the LOG_QN and LOG_QNZ normalization methods combined with the
neural network classification model seem to achieve better performance.
Therefore, NDEG-based normalization appears useful for cross-platform testing
on completely independent datasets. However, more studies are required to
examine whether NDEG-based normalization can improve ML classification
performance in other datasets and other omic data types.