What does it take to learn the rules of RNA base pairing? A lot less than you may think.
Journal:
Communications biology
Published Date:
Mar 26, 2026
Abstract
Amidst the fast-developing trend of RNA large language models with millions of parameters, we asked what would be minimally required to rediscover the rules of RNA canonical base pairing that define secondary structure, namely the Watson-Crick-Franklin A:U, G:C and the wobble G:U base pairs. Here, we conclude that it does not require much at all. It does not require knowing secondary structures, it does not require aligning the sequences, and it does not require many parameters. We selected a probabilistic model (a stochastic context-free grammar or SCFG) with a total of just 21 parameters, that can describe arbitrary pairwise interactions including but not restricted to those of RNA base pairing. Using standard deep learning techniques, we estimate its parameters by implementing the generative process in an automatic differentiation (autodiff) framework and applying stochastic gradient descent (SGD). We define and minimize a loss function that does not use any structural or alignment information. Trained on as few as fifty RNA sequences, the specific rules of RNA base pairing emerge after only a few iterations of SGD. Crucially, the sole inputs are RNA sequences. When optimizing for sequences corresponding to structured RNAs, SGD also yields the rules of RNA base-pair aggregation into helices. In sharp contrast, when trained on shuffled sequences, the system optimizes by avoiding base pairing altogether. Trained on messenger RNAs, it reveals interactions that are different from those of structural RNAs, and specific to each mRNA. We demonstrate that our approach generalizes across diverse RNA families by testing on 1094 sequences from 22 structurally distinct RNA families. Our results show that the emergence of canonical RNA base-pairing can be attributed to sequence-level signals that are robust and detectable even without labeled structures or alignments, and with very few parameters. Autodiff algorithms for probabilistic models, such as, but not restricted to SCFGs, have significant potential as they allow these models to be incorporated into end-to-end RNA deep learning methods for discerning transcripts of different functionalities.
Authors
Keywords
No keywords available for this article.