Bridging Large Language Models and Single-Cell Transcriptomics in Dissecting Selective Motor Neuron Vulnerability
Journal:
arXiv
Published Date:
May 12, 2025
Abstract
Understanding cell identity and function through single-cell level sequencing
data remains a key challenge in computational biology. We present a novel
framework that leverages gene-specific textual annotations from the NCBI Gene
database to generate biologically contextualized cell embeddings. For each cell
in a single-cell RNA sequencing (scRNA-seq) dataset, we rank genes by
expression level, retrieve their NCBI Gene descriptions, and transform these
descriptions into vector embedding representations using large language models
(LLMs). The models used include OpenAI text-embedding-ada-002,
text-embedding-3-small, and text-embedding-3-large (Jan 2024), as well as
domain-specific models BioBERT and SciBERT. Embeddings are computed via an
expression-weighted average across the top N most highly expressed genes in
each cell, providing a compact, semantically rich representation. This
multimodal strategy bridges structured biological data with state-of-the-art
language modeling, enabling more interpretable downstream applications such as
cell-type clustering, cell vulnerability dissection, and trajectory inference.