Simple and effective embedding model for single-cell biology built from ChatGPT.

Journal: Nature biomedical engineering

Published Date: Dec 6, 2024

Abstract

Large-scale gene-expression data are being leveraged to pretrain models that implicitly learn gene and cellular functions. However, such models require extensive data curation and training. Here we explore a much simpler alternative: leveraging ChatGPT embeddings of genes based on the literature. We used GPT-3.5 to generate gene embeddings from text descriptions of individual genes and to then generate single-cell embeddings by averaging the gene embeddings weighted by each gene's expression level. We also created a sentence embedding for each cell by using only the gene names ordered by their expression level. On many downstream tasks used to evaluate pretrained single-cell embedding models-particularly, tasks of gene-property and cell-type classifications-our model, which we named GenePT, achieved comparable or better performance than models pretrained from gene-expression profiles of millions of cells. GenePT shows that large-language-model embeddings of the literature provide a simple and effective path to encoding single-cell biological knowledge.

Authors

Yiqun Chen

Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
James Zou

Department of Biomedical Data Science, Stanford University, Stanford, California.

Keywords

Algorithms Computational Biology Gene Expression Profiling Generative Artificial Intelligence Humans Single-Cell Analysis

External Resources

View on PubMed Access via DOI PubMed (39643729)

Simple and effective embedding model for single-cell biology built from ChatGPT.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals