Gene-Family Encoding Boosts Domain-Adapted Single-Cell Language Models

Journal: bioRxiv

Published Date: Jan 1, 2025

Abstract

Transformer-based single-cell foundation models often rely on ranked-gene (RG) sequences where genes, ranked by expression, are often not functionally related, weakening next-token learning and the structure of learned embeddings. Here, we introduce gene-family (GF) encoding, where expressed genes are grouped into functionally defined families, and ranking performed within each family. Using 100,000 gastric-cancer (GC) cells, we domain-adapted 8-billion-parameter Llama and Qwen backbones with either RG or GF sentences and benchmarked zero-shot performance on embedding-based and generative tasks. GF models outperformed RG models across both tasks and backbones. We scaled GF-Llama to 1.3 million GC cells to obtain GF-Llama-GC and applied it to two applications: resolving fine-grained cellular heterogeneity and discovering cell populations associated with disease progression. GF-Llama-GC revealed immune-cell subclusters not resolved by standard expression-based analyses. Applying in-silico cell removal/transplantation on a chemotherapy responder/non-responder scRNA-seq dataset, GF-Llama-GC highlighted not only epithelial cells but also neutrophils as key cells associated with chemotherapy response.

Authors

Haoran Ma; Chang Xu; Shamaine Wei Ting Ho; Joseph J Zhao; Yunqiang Chu; Angie Lay Keng Tan; Raghav Sundar; Patrick Tan

External Resources

View on bioRxiv Access via DOI

Gene-Family Encoding Boosts Domain-Adapted Single-Cell Language Models

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Gene-Family Encoding Boosts Domain-Adapted Single-Cell Language Models

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals