Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
Foundation models have transformed natural language processing and computer vision, yet their potential in single-cell biology—particularly for complex diseases such as cancer— remains underexplored. We present Tahoe-x1 (Tx1), a family of perturbation-trained single-cell foundation models with up to 3 billion parameters. Tx1 is pretrained on large-scale single-cell transcriptomic datasets, including the Tahoe-100M perturbation compendium, and fine-tuned for cancer-relevant tasks. Through architectural optimizations, data loader refinements, and efficient training strategies, Tx1 achieves 3–30× higher compute efficiency than prior implementations of cell-state models. Tx1 jointly learns representations of genes, cells, and compounds using a masked-expression generative objective that incorporates a drug token, enabling flexible adaptation to diverse downstream applications. We evaluate Tx1 across four key disease-relevant benchmarks: (1) prediction of overall and context-specific gene essentiality, (2) identification of genes contributing to the hallmarks of cancer, (3) cell-type classification, and (4) prediction of perturbation responses in held-out cellular contexts. Tx1 achieves state-of-the-art performance across all tasks. We release pretrained checkpoints, training code, and evaluation workflows to accelerate the development of perturbation-trained single-cell foundation models for applications in precision oncology and beyond.