RNA3DB: A structurally-dissimilar dataset split for training and benchmarking deep learning models for RNA structure prediction.

Journal: Journal of molecular biology
Published Date:

Abstract

With advances in protein structure prediction thanks to deep learning models like AlphaFold, RNA structure prediction has recently received increased attention from deep learning researchers. RNAs introduce substantial challenges due to the sparser availability and lower structural diversity of the experimentally resolved RNA structures in comparison to protein structures. These challenges are often poorly addressed by the existing literature, many of which report inflated performance due to using training and testing sets with significant structural overlap. Further, the most recent Critical Assessment of Structure Prediction (CASP15) has shown that deep learning models for RNA structure are currently outperformed by traditional methods. In this paper we present RNA3DB, a dataset of structured RNAs, derived from the Protein Data Bank (PDB), that is designed for training and benchmarking deep learning models. The RNA3DB method arranges the RNA 3D chains into distinct groups (Components) that are non-redundant both with regard to sequence as well as structure, providing a robust way of dividing training, validation, and testing sets. Any split of these structurally-dissimilar Components are guaranteed to produce test and validations sets that are distinct by sequence and structure from those in the training set. We provide the RNA3DB dataset, a particular train/test split of the RNA3DB Components (in an approximate 70/30 ratio) that will be updated periodically. We also provide the RNA3DB methodology along with the source-code, with the goal of creating a reproducible and customizable tool for producing structurally-dissimilar dataset splits for structural RNAs.

Authors

  • Marcell Szikszai
    Department of Computer Science & Software Engineering, The University of Western Australia, Perth, WA 6009, Australia.
  • Marcin Magnus
    Department of Molecular and Cellular Biology, Harvard University, Cambridge, 02138, MA, USA.
  • Siddhant Sanghi
    Department of Systems Biology, Columbia University, New York 10027, NY, USA; College of Biological Sciences, UC Davis, Davis 95616, CA, USA.
  • Sachin Kadyan
    Department of Systems Biology, Columbia University, New York 10027, NY, USA.
  • Nazim Bouatta
    Laboratory of Systems Pharmacology, Program in Therapeutic Science, Harvard Medical School, Boston, MA, USA. nazim_bouatta@hms.harvard.edu.
  • Elena Rivas
    Department of Molecular and Cellular Biology, Harvard University, Cambridge, 02138, MA, USA.