PERC: a suite of software tools for the curation of cryoEM data with application to simulation, modelling and machine learning
Journal:
arXiv
Published Date:
Mar 17, 2025
Abstract
Ease of access to data, tools and models expedites scientific research. In
structural biology there are now numerous open repositories of experimental and
simulated datasets. Being able to easily access and utilise these is crucial
for allowing researchers to make optimal use of their research effort. The
tools presented here are useful for collating existing public cryoEM datasets
and/or creating new synthetic cryoEM datasets to aid the development of novel
data processing and interpretation algorithms. In recent years, structural
biology has seen the development of a multitude of machine-learning based
algorithms for aiding numerous steps in the processing and reconstruction of
experimental datasets and the use of these approaches has become widespread.
Developing such techniques in structural biology requires access to large
datasets which can be cumbersome to curate and unwieldy to make use of. In this
paper we present a suite of Python software packages which we collectively
refer to as PERC (profet, EMPIARreader and CAKED). These are designed to reduce
the burden which data curation places upon structural biology research. The
protein structure fetcher (profet) package allows users to conveniently
download and cleave sequences or structures from the Protein Data Bank or
Alphafold databases. EMPIARreader allows lazy loading of Electron Microscopy
Public Image Archive datasets in a machine-learning compatible structure. The
Class Aggregator for Key Electron-microscopy Data (CAKED) package is designed
to seamlessly facilitate the training of machine learning models on electron
microscopy data, including electron-cryo-microscopy-specific data augmentation
and labelling. These packages may be utilised independently or as building
blocks in workflows. All are available in open source repositories and designed
to be easily extensible to facilitate more advanced workflows if required.