ChromFound: Towards A Universal Foundation Model for Single-Cell Chromatin Accessibility Data
Journal:
arXiv
Published Date:
May 19, 2025
Abstract
The advent of single-cell Assay for Transposase-Accessible Chromatin using
sequencing (scATAC-seq) offers an innovative perspective for deciphering
regulatory mechanisms by assembling a vast repository of single-cell chromatin
accessibility data. While foundation models have achieved significant success
in single-cell transcriptomics, there is currently no foundation model for
scATAC-seq that supports zero-shot high-quality cell identification and
comprehensive multi-omics analysis simultaneously. Key challenges lie in the
high dimensionality and sparsity of scATAC-seq data, as well as the lack of a
standardized schema for representing open chromatin regions (OCRs). Here, we
present ChromFound, a foundation model tailored for scATAC-seq. ChromFound
utilizes a hybrid architecture and genome-aware tokenization to effectively
capture genome-wide long contexts and regulatory signals from dynamic chromatin
landscapes. Pretrained on 1.97 million cells from 30 tissues and 6 disease
conditions, ChromFound demonstrates broad applicability across 6 diverse tasks.
Notably, it achieves robust zero-shot performance in generating universal cell
representations and exhibits excellent transferability in cell type annotation
and cross-omics prediction. By uncovering enhancer-gene links undetected by
existing computational methods, ChromFound offers a promising framework for
understanding disease risk variants in the noncoding genome.