Mix-Geneformer: Unified Representation Learning for Human and Mouse scRNA-seq Data
Journal:
arXiv
Published Date:
Jul 10, 2025
Abstract
Single-cell RNA sequencing (scRNA-seq) enables single-cell transcriptomic
profiling, revealing cellular heterogeneity and rare populations. Recent deep
learning models like Geneformer and Mouse-Geneformer perform well on tasks such
as cell-type classification and in silico perturbation. However, their
species-specific design limits cross-species generalization and translational
applications, which are crucial for advancing translational research and drug
discovery. We present Mix-Geneformer, a novel Transformer-based model that
integrates human and mouse scRNA-seq data into a unified representation via a
hybrid self-supervised approach combining Masked Language Modeling (MLM) and
SimCSE-based contrastive loss to capture both shared and species-specific gene
patterns. A rank-value encoding scheme further emphasizes high-variance gene
signals during training. Trained on about 50 million cells from diverse human
and mouse organs, Mix-Geneformer matched or outperformed state-of-the-art
baselines in cell-type classification and in silico perturbation tasks,
achieving 95.8% accuracy on mouse kidney data versus 94.9% from the best
existing model. It also successfully identified key regulatory genes validated
by in vivo studies. By enabling scalable cross-species transcriptomic modeling,
Mix-Geneformer offers a powerful tool for comparative transcriptomics and
translational applications. While our results demonstrate strong performance,
we also acknowledge limitations, such as the computational cost and variability
in zero-shot transfer.