A natural language processing system for the efficient extraction of cell markers.

Journal: Scientific reports
PMID:

Abstract

Single-cell RNA sequencing (scRNA-seq) has emerged as a pivotal tool for exploring cellular landscapes across diverse species and tissues. Precise annotation of cell types is essential for understanding these landscapes, relying heavily on empirical knowledge and curated cell marker databases. In this study, we introduce MarkerGeneBERT, a natural language processing (NLP) system designed to extract critical information from the literature regarding species, tissues, cell types, and cell marker genes in the context of single-cell sequencing studies. Leveraging MarkerGeneBERT, we systematically parsed full-text articles from 3702 single-cell sequencing-related studies, yielding a comprehensive collection of 7901 cell markers representing 1606 cell types across 425 human tissues/subtissues, and 8223 cell markers representing 1674 cell types across 482 mouse tissues/subtissues. Comparative analysis against manually curated databases demonstrated that our approach achieved 76% completeness and 75% accuracy, while also unveiling 89 cell types and 183 marker genes absent from existing databases. Furthermore, we successfully applied the compiled brain tissue marker gene list from MarkerGeneBERT to annotate scRNA-seq data, yielding results consistent with original studies. Conclusions: Our findings underscore the efficacy of NLP-based methods in expediting and augmenting the annotation and interpretation of scRNA-seq data, providing a systematic demonstration of the transformative potential of this approach. The 27323 manual reviewed sentences for training MarkerGeneBERT and the source code are hosted at https://github.com/chengpeng1116/MarkerGeneBERT .

Authors

  • Peng Cheng
    University of Kansas Medical Center, Department of Internal Medicine, Division of Medical Informatics, Kansas City, KS, USA.
  • Yan Peng
    Key Acu-moxibustion Laboratory of Biological Information Analysis of Institute of Acupuncture, Moxibustion and Massage, Hunan University of Chinese Medicine, Changsha 410007, China.
  • Xiao-Ling Zhang
    Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China.
  • Sheng Chen
    Department of Thoracic Surgery, The Affiliated Huaian No.1 People's Hospital of Nanjing Medical University, Huai'an, China.
  • Bin-Bin Fang
    Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China.
  • Yan-Ze Li
    Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China. yanzeli@capitalbiotech.com.
  • Yi-Min Sun
    Marketing and Management Department, CapitalBio Technology, Beijing, 100176, China. yiminsun_pub@capitalbiotech.com.