Simple and effective embedding model for single-cell biology built from ChatGPT.

Journal: Nature biomedical engineering
Published Date:

Abstract

Large-scale gene-expression data are being leveraged to pretrain models that implicitly learn gene and cellular functions. However, such models require extensive data curation and training. Here we explore a much simpler alternative: leveraging ChatGPT embeddings of genes based on the literature. We used GPT-3.5 to generate gene embeddings from text descriptions of individual genes and to then generate single-cell embeddings by averaging the gene embeddings weighted by each gene's expression level. We also created a sentence embedding for each cell by using only the gene names ordered by their expression level. On many downstream tasks used to evaluate pretrained single-cell embedding models-particularly, tasks of gene-property and cell-type classifications-our model, which we named GenePT, achieved comparable or better performance than models pretrained from gene-expression profiles of millions of cells. GenePT shows that large-language-model embeddings of the literature provide a simple and effective path to encoding single-cell biological knowledge.

Authors

  • Yiqun Chen
    Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
  • James Zou
    Department of Biomedical Data Science, Stanford University, Stanford, California.