Deepm5C: A deep-learning-based hybrid framework for identifying human RNA N5-methylcytosine sites using a stacking strategy.

Journal: Molecular therapy : the journal of the American Society of Gene Therapy
Published Date:

Abstract

As one of the most prevalent post-transcriptional epigenetic modifications, N5-methylcytosine (m5C) plays an essential role in various cellular processes and disease pathogenesis. Therefore, it is important accurately identify m5C modifications in order to gain a deeper understanding of cellular processes and other possible functional mechanisms. Although a few computational methods have been proposed, their respective models have been developed using small training datasets. Hence, their practical application is quite limited in genome-wide detection. To overcome the existing limitations, we propose Deepm5C, a bioinformatics method for identifying RNA m5C sites throughout the human genome. To develop Deepm5C, we constructed a novel benchmarking dataset and investigated a mixture of three conventional feature-encoding algorithms and a feature derived from word-embedding approaches. Afterward, four variants of deep-learning classifiers and four commonly used conventional classifiers were employed and trained with the four encodings, ultimately obtaining 32 baseline models. A stacking strategy is effectively utilized by integrating the predicted output of the optimal baseline models and trained with a one-dimensional (1D) convolutional neural network. As a result, the Deepm5C predictor achieved excellent performance during cross-validation with a Matthews correlation coefficient and an accuracy of 0.697 and 0.855, respectively. The corresponding metrics during the independent test were 0.691 and 0.852, respectively. Overall, Deepm5C achieved a more accurate and stable performance than the baseline models and significantly outperformed the existing predictors, demonstrating the effectiveness of our proposed hybrid framework. Furthermore, Deepm5C is expected to assist community-wide efforts in identifying putative m5Cs and to formulate the novel testable biological hypothesis.

Authors

  • Md Mehedi Hasan
    Nutrition and Clinical Services Division, International Center for Diarrheal Disease and Research, Bangladesh (icddr,b), Dhaka, Bangladesh.
  • Sho Tsukiyama
    Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan.
  • Jae Youl Cho
    Department of Integrative Biotechnology, And Biomedical Institute for Convergence at SKKU (BICS), Sungkyunkwan University, Suwon, Republic of Korea.
  • Hiroyuki Kurata
  • Md Ashad Alam
    Department of Biomedical Engineering, Tulane University, New Orleans, LA 70118, USA. Electronic address: malam@tulane.edu.
  • Xiaowen Liu
    School of Informatics and Computing, Indiana University-Purdue University Indianapolis, Indianapolis, Indiana 46202, United States.
  • Balachandran Manavalan
    Department of Physiology, Ajou University School of Medicine, Suwon, Republic of Korea.
  • Hong-Wen Deng
    Center for Bioinformatics and Genomics, Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA 70112, USA.