A Large-Scale Cryo-EM RNA Motif Dataset and Benchmark for Machine Learning-Based Structure Modeling
Journal:
bioRxiv
Published Date:
Feb 9, 2026
Abstract
Motivation: RNA molecules play critical roles in gene regulation, viral replication, and cellular control, with their functions tightly coupled to three-dimensional structure. Advances in cryogenic electron microscopy (cryo-EM) now enable RNA structure characterization across a broad resolution range. RNA secondary structural motifs, including hairpins, internal loops, and bulges, act as fundamental building blocks of RNA tertiary architecture and are key targets in RNA-focused therapeutic design. Despite this, most computational approaches for RNA structure prediction from cryo-EM density maps do not explicitly utilize secondary structural motifs as intermediate representations, largely due to the absence of large-scale, high-quality, and motif-resolved datasets suitable for machine learning. Results: Here, we present a large, open-source dataset containing over 125,000 motif-resolved cryo-EM density maps paired with corresponding atomic structures, spanning 25 classes of RNA secondary structural motifs. The dataset covers resolutions from 1.5 [A] to 34.0 [A], encompassing both near-atomic and low-resolution density maps relevant to RNA modeling. Each motif instance includes a segmented cryo-EM density map represented as a standardized 3D voxel grid, with atomic-level motif annotations propagated to voxel-level labels for RNA backbone, ribose sugar, and nucleobase components. Segmentation quality is validated via cross-correlation analysis, demonstrating strong agreement between motif-level density maps and atomic reference models. To illustrate the dataset's utility, high-resolution maps (1.5 -l 2.8 [A]) were used to train a machine learning classifier that distinguished five motif classes with a specificity of 0.948. Availability and Implementation: Source code, implementation of the fully automated pipeline, and the benchmark datasets are publicly available at GitHub: https://github.com/DrDongSi/3DEM-RNA-Motif- Dataset Zenodo: https://zenodo.org/communities/3dem-rna-motif- dataset