Incorporating time as a third dimension in transcriptomic analysis using machine learning and explainable AI.

Journal: Computational biology and chemistry
PMID:

Abstract

Transcriptomic data analysis entails the measurement of RNA transcript (gene expression products) abundance in a cell or a cell population at a single point in time. In other words, transcriptomics as it is currently practiced is two-dimensional (2DTA). Gene expression profiling by 2DTA has proven invaluable in furthering our understanding of numerous biological processes in health and disease. That said, shortcomings including technical variability, small sample size, differential rates of transcript decay, and the lack of linearity between transcript abundance and functionality or the formation of functional proteins limit the interpretive utility and generalizability of transcriptomic data. 2DTA utility may also be constrained by its reliance on RNA extracts obtained at a single time point. In other words, much like judging a movie by a single frame, 2DTA can only provide a snapshot of the transcriptome at time of RNA extraction. Whether this perceived "temporality" problem is real and whether it has any bearing on transcriptomic data interpretation have yet to be addressed. To investigate this problem, 25 publicly available datasets relating to MCF-7 cells, where RNA extracts obtained at 12- or 48-hours post-culture were subjected to transcriptomic analysis. The individual datasets were downloaded and compiled into two separate datasets (MCF-7 U12hr and MCF-7 U48hr). To comparatively analyze the two compiled datasets, three machine learning approaches (decision trees (DT), random forests (RF), and XGBoost (Extreme Gradient Boosting)) were used as classifiers to search for genes with distinct expression patterns between the two groups. Shapley additive explanation (SHAP), an explainable AI method, was used to assess the fundamental principles of the DT, RF, and XGBoost models. Coefficient of Determination (DC), Mean Absolute Error (MAE), and Mean Squared Error (MSE) were used to evaluate the models. The results show that the two datasets exhibited very significant gene expression patterns. The XGBoost model performed better than the DT or RF models with MSE, MAE, and DC values of 0.00028, 0.00028, and 0.95778 respectively. These observations suggest that time, as a third dimension, can impact transcriptomic data interpretation and that machine learning and explainable AI are useful tools in resolving the temporality problem in transcriptomics.

Authors

  • Zubaida Said Ameen
    Operational Research Centre in Healthcare, Near East University, TRNC Mersin 10, Nicosia, 99138, Turkey.
  • Auwalu Saleh Mubarak
    Operational Research Centre in Healthcare, Near East University, TRNC Mersin 10, Nicosia, 99138, Turkey.
  • Mohamed Hamad
    Department of Medical Laboratory Sciences, College of Health Sciences, University of Sharjah, UAE; Research Institute of Medical and Health Sciences, University of Sharjah, UAE.
  • Rifat Hamoudi
  • Sherlyn Jemimah
    Department of Biomedical Engineering, Khalifa University, PO Box 127788, Abu Dhabi, United Arab Emirates.
  • Dilber Uzun Ozsahin
    Near East University, Nicosia/TRNC, Mersin-10, 99138, Turkey.
  • Mawieh Hamad
    Department of Medical Laboratory Sciences, College of Health Sciences, University of Sharjah, UAE; Research Institute of Medical and Health Sciences, University of Sharjah, UAE. Electronic address: mabdelhaq@sharjah.ac.ae.