AI methods and biologically informed data curation enable accurate RNA m5C prediction

Journal: bioRxiv
Published Date:

Abstract

5-methylcytosine (m5C) RNA modifications influence nearly every aspect of RNA metabolism, but their transcriptome wide detection is limited by costly, error-prone assays. To bridge this experimental gap, a wave of AI tools now predicts putative m5C sites in silico. However, most existing approaches prioritize architectural complexity while neglecting data quality, so their reported gains mainly reflect the artifacts inherited from noisy datasets. We inverted this paradigm by constructing a high-confidence, methyltransferase-specific catalog of m5C sites, removing artifacts that confound existing resources. Using this curated corpus, we trained (for the first time in a multiclass setting) three different models (Bi-GRU, CNN, Transformer) to distinguish writer-specific m5C sites from unmethylated cytosines. All AI models converged to similar, nearly optimal, performance (AUPRC > 0.97), and a biologically informed analysis revealed that most errors clustered in unmethylated sites mimicking true positives. By augmenting the training set with these hard-to-predict negatives, mined from millions of unmodified cytosines, the models were forced to exploit more nuanced features such as RNA secondary structure and subtle sequence cues, which sharply reduced transcriptome-wide false positive predictions, and predicted methylated transcripts exhibited strong concordance with known methyltransferase biology. Explainable AI techniques also showed that our AI models effectively capture how sequence mutations disrupt m5C sites, underscoring their potential to prioritize disease-relevant variants. The main findings of our study underscore that AI models can be decisive levers for reliable m5C identification only if fed with curated data and validated through biologically informed computational analysis.

Authors

  • Emanuele Saitto; Elena Casiraghi; Alberto Paccanaro; Giorgio Valentini