DUBStepR is a scalable correlation-based feature selection method for accurately clustering single-cell data.

Journal: Nature communications
PMID:

Abstract

Feature selection (marker gene selection) is widely believed to improve clustering accuracy, and is thus a key component of single cell clustering pipelines. Existing feature selection methods perform inconsistently across datasets, occasionally even resulting in poorer clustering accuracy than without feature selection. Moreover, existing methods ignore information contained in gene-gene correlations. Here, we introduce DUBStepR (Determining the Underlying Basis using Stepwise Regression), a feature selection algorithm that leverages gene-gene correlations with a novel measure of inhomogeneity in feature space, termed the Density Index (DI). Despite selecting a relatively small number of genes, DUBStepR substantially outperformed existing single-cell feature selection methods across diverse clustering benchmarks. Additionally, DUBStepR was the only method to robustly deconvolve T and NK heterogeneity by identifying disease-associated common and rare cell types and subtypes in PBMCs from rheumatoid arthritis patients. DUBStepR is scalable to over a million cells, and can be straightforwardly applied to other data types such as single-cell ATAC-seq. We propose DUBStepR as a general-purpose feature selection solution for accurately clustering single-cell data.

Authors

  • Bobby Ranjan
    Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore.
  • Wenjie Sun
  • Jinyu Park
    Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore.
  • Kunal Mishra
    Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore.
  • Florian Schmidt
    Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, 66123, Germany.
  • Ronald Xie
    Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore.
  • Fatemeh Alipour
    School of Computer Science, University of Waterloo, Waterloo, ON, Canada.
  • Vipul Singhal
    Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore.
  • Ignasius Joanito
    Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore.
  • Mohammad Amin Honardoost
    Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore.
  • Jacy Mei Yun Yong
    Department of Rheumatology, Allergy and Immunology, Tan Tock Seng Hospital, Singapore, 308433, Singapore.
  • Ee Tzun Koh
    Department of Rheumatology, Allergy and Immunology, Tan Tock Seng Hospital, Singapore, 308433, Singapore.
  • Khai Pang Leong
    Department of Rheumatology, Allergy and Immunology, Tan Tock Seng Hospital, Singapore, 308433, Singapore.
  • Nirmala Arul Rayan
    Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore.
  • Michelle Gek Liang Lim
    Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672, Singapore.
  • Shyam Prabhakar
    Computational and Systems Biology, Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Genome, #02-01, Singapore, 138672, Singapore.