Gene-Family Encoding Boosts Domain-Adapted Single-Cell Language Models
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
Transformer-based single-cell foundation models often rely on ranked-gene (RG) sequences where genes, ranked by expression, are often not functionally related, weakening next-token learning and the structure of learned embeddings. Here, we introduce gene-family (GF) encoding, where expressed genes are grouped into functionally defined families, and ranking performed within each family. Using 100,000 gastric-cancer (GC) cells, we domain-adapted 8-billion-parameter Llama and Qwen backbones with either RG or GF sentences and benchmarked zero-shot performance on embedding-based and generative tasks. GF models outperformed RG models across both tasks and backbones. We scaled GF-Llama to 1.3 million GC cells to obtain GF-Llama-GC and applied it to two applications: resolving fine-grained cellular heterogeneity and discovering cell populations associated with disease progression. GF-Llama-GC revealed immune-cell subclusters not resolved by standard expression-based analyses. Applying in-silico cell removal/transplantation on a chemotherapy responder/non-responder scRNA-seq dataset, GF-Llama-GC highlighted not only epithelial cells but also neutrophils as key cells associated with chemotherapy response.