Heimdall: A Modular Framework for Tokenization in Single-Cell Foundation Models
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
Foundation models trained on single-cell RNA-sequencing (scRNA-seq) data have rapidly become powerful tools for single-cell analysis. Their performance, however, depends critically on how cells are tokenized into model inputs – a design space that remains poorly understood. Here, we present Heimdall, a comprehensive framework and open-source toolkit for systematically evaluating tok-enization strategies in single-cell foundation models (scFMs). Heimdall decomposes each scFM into modular components: a gene identity encoder (FG), an expression encoder (FE), and a “cell sentence” constructor (FC) with submodules (order, sequence, and reduce) enabling fine-grained control and attribution. Using a transformer trained from scratch, we evaluate tokenization strategies for cell type classification across challenging transfer learning settings – cross-tissue, cross-species, and spatial gene-panel shifts – and separately assess reverse perturbation prediction. Tokenization choices show minimal impact in-distribution but are decisive under distribution shift, with FG and order driving the largest gains and FE providing additional improvements. Heimdall further shows how existing strategies can be recombined to enhance generalization. By standardizing evaluation and providing an extensive library, Heimdall establishes a foundation for reproducible, systematic exploration of single-cell tokenization and accelerates the development of next-generation scFMs.