GNNs and ensemble models enhance the prediction of new sRNA-mRNA interactions in unseen conditions.

Journal: BMC bioinformatics
Published Date:

Abstract

Bacterial small RNAs (sRNAs) are pivotal in post-transcriptional regulation, affecting functions like virulence, metabolism, and gene expression by binding specific mRNA targets. Identifying these targets is crucial to understanding sRNA regulation across species. Despite advancements in high-throughput (HT) experimental methods, they remain technically challenging and are limited to detecting sRNA-target interactions under specific environmental conditions. Therefore, computational approaches, especially machine learning (ML), are essential for identifying strong candidates for biological validation. In this paper, we hypothesize that ML models trained on large-scale interaction data from specific conditions can accurately predict new interactions in unseen conditions within the same bacterial strain. To test this, we developed models from two families: (1) graph neural networks (GNNs), including GraphRNA and kGraphRNA, that learn transformed representations of interacting sRNA-mRNA pairs via graph relationships, and (2) decision forests, sInterRF (Random Forest) and sInterXGB (XGBoost), which use various interaction features for prediction. We also proposed Summation Ensemble Models (SEM) that combine scores from multiple models. Across three seen-to-unseen conditions evaluations, our models -particularly kGraphRNA- significantly improved the area under the ROC curve (AUC) and Precision-Recall curve (PR-AUC) compared to sRNARFTarget, CopraRNA, and RNAup. The SEM model combining GraphRNA and CopraRNA outperformed CopraRNA alone on a low-throughput (LT) interactions test set (HT-to-LT evaluation). Beyond enhanced performance, our models enable target prediction for species-specific sRNAs, a capability lacking in some existing tools. Furthermore, GNN models remove the dependency on external tools like RNAplex or RNAup to compute hybridization duplex or energy features, enhancing scalability and runtime efficiency. While this study focuses on E. coli K12 MG1655 interactions, our methods are fully adaptable to predict interactions in other bacterial strains, given sufficient data for training. Our comprehensive feature importance analysis revealed the complexity of sRNA-mRNA interactions across environmental conditions, underscoring the significance of RNA sequence composition and duplex structure characteristics, like base pairing and energy factors; findings that align with biological evidence from previous studies. As HT experiments expand sRNA-target interaction data across conditions in various bacteria, our ML methods with features analysis offer promising advances in sRNA-target prediction and deeper insights into sRNA regulatory mechanisms across diverse species.

Authors

  • Shani Cohen
    Department of Software & Information Systems Engineering, Faculty of Engineering, Ben-Gurion University of the Negev, 8410501, Beer-Sheva, Israel.
  • Lior Rokach
  • Isana Veksler-Lublinsky
    Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel.