Integrating multi-encoding sequence features via stacking ensemble learning for RNA m5C site prediction.

Journal: Nucleosides, nucleotides & nucleic acids
Published Date:

Abstract

RNA 5-methylcytosine (m5C) is an important epitranscriptomic modification involved in RNA stability, translation, and post-transcriptional regulation. Accurate identification of m5C sites remains challenging due to limited sequence representation and insufficient feature integration in existing computational methods. In this study, we propose a comprehensive machine learning framework that integrates six complementary sequence encoding schemes, including enhanced nucleic acid composition (ENAC), tri-nucleotide composition (TNC), composition of K-spaced nucleic acid pairs (CKSNAP), pseudo-electron-ion interaction potential (PseEIIP), one-hot encoding, and nucleotide chemical properties (NCP). Each encoding is paired with an optimal classifier, and a stacking ensemble strategy is employed to fuze the outputs of base classifiers. The model is trained using 5-fold cross-validation for base learners and 3-fold cross-validation for the meta-learner. Performance evaluation using multiple metrics demonstrates that the proposed approach achieves improved robustness and cross-dataset generalization, with an accuracy of 75.5%, MCC of 0.51, and PR-AUC of 0.82. These results indicate that the proposed fusion-based ensemble framework provides an effective and reliable solution for RNA m5C site prediction.

Authors

Keywords

No keywords available for this article.