scReader: Prompting Large Language Models to Interpret scRNA-seq Data
Journal:
arXiv
Published Date:
Dec 24, 2024
Abstract
Large language models (LLMs) have demonstrated remarkable advancements,
primarily due to their capabilities in modeling the hidden relationships within
text sequences. This innovation presents a unique opportunity in the field of
life sciences, where vast collections of single-cell omics data from multiple
species provide a foundation for training foundational models. However, the
challenge lies in the disparity of data scales across different species,
hindering the development of a comprehensive model for interpreting genetic
data across diverse organisms. In this study, we propose an innovative hybrid
approach that integrates the general knowledge capabilities of LLMs with
domain-specific representation models for single-cell omics data
interpretation. We begin by focusing on genes as the fundamental unit of
representation. Gene representations are initialized using functional
descriptions, leveraging the strengths of mature language models such as
LLaMA-2. By inputting single-cell gene-level expression data with prompts, we
effectively model cellular representations based on the differential expression
levels of genes across various species and cell types. In the experiments, we
constructed developmental cells from humans and mice, specifically targeting
cells that are challenging to annotate. We evaluated our methodology through
basic tasks such as cell annotation and visualization analysis. The results
demonstrate the efficacy of our approach compared to other methods using LLMs,
highlighting significant improvements in accuracy and interoperability. Our
hybrid approach enhances the representation of single-cell data and offers a
robust framework for future research in cross-species genetic analysis.