HitAnno: Atlas-level cell type annotation based on scATAC-seq data via a hierarchical language model
Journal:
bioRxiv
Published Date:
Mar 12, 2026
Abstract
The single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) has emerged as a core technology for dissecting cellular epigenomic heterogeneity and gene regulatory programs. With the emergence of atlas-level scATAC-seq datasets, cell type annotation increasingly faces challenges arising from unprecedented data scale and increased cell-type diversity, which together place stringent demands on model reliability and robustness. Here, we introduce HitAnno, a hierarchical language model capable of accurate and scalable cell type annotation in atlas-level scATAC-seq data. Leveraging selected cell-type-specific peaks to construct cell sentences, HitAnno employs a two-level attention mechanism that captures accessibility profiles hierarchically. Extensive evaluations show that HitAnno robustly annotates both major and rare cell types across multiple settings, including intra-dataset, cross-donor and inter-dataset annotation. The hierarchical attention mechanisms of the model reveal co-accessibility patterns among peaks and dependencies across higher-order peak sets, ensuring an interpretable annotation process. Training on a 31-cell-type human atlas, HitAnno can directly annotate new query datasets without retraining and is accessible through an online interface. Our model identifies heterogeneous subgroups within mixed labeled cells from unseen datasets, demonstrating its potential to assist researchers in refining existing cell atlases.