scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics
Journal:
arXiv
Published Date:
Jun 2, 2025
Abstract
Modern single-cell datasets now comprise hundreds of millions of cells,
presenting significant challenges for training deep learning models that
require shuffled, memory-efficient data loading. While the AnnData format is
the community standard for storing single-cell datasets, existing data loading
solutions for AnnData are often inadequate: some require loading all data into
memory, others convert to dense formats that increase storage demands, and many
are hampered by slow random disk access. We present scDataset, a PyTorch
IterableDataset that operates directly on one or more AnnData files without the
need for format conversion. The core innovation is a combination of block
sampling and batched fetching, which together balance randomness and I/O
efficiency. On the Tahoe 100M dataset, scDataset achieves up to a 48$\times$
speed-up over AnnLoader, a 27$\times$ speed-up over HuggingFace Datasets, and
an 18$\times$ speed-up over BioNeMo in single-core settings. These advances
democratize large-scale single-cell model training for the broader research
community.