flowSim: Near duplicate detection for flow cytometry data.

Journal: Cytometry. Part A : the journal of the International Society for Analytical Cytology
Published Date:

Abstract

The analysis of large amounts of data is important for the development of machine learning (ML) models. flowSim is the first algorithm designed to visualize, detect and remove highly redundant information in flow cytometry (FCM) training sets to decrease the computational time for training and increase the performance of ML algorithms by reducing overfitting. flowSim performs near duplicate image detection by combining community detection algorithms with the density analysis of the marker expression values. flowSim clustering compared to consensus manual clustering on a dataset composed of 160 images of bivariate FCM data had a mean Adjusted Rand Index of 0.90, demonstrating its efficiency in identifying similar patterns. flowSim selectively discarded near duplicate files in datasets constructed with known redundancy, and removed 92.6% of FCM images in a dataset of over 500,000 drawn from public repositories.

Authors

  • Sebastiano Montante
    Terry Fox Laboratory, BC Cancer Research, Vancouver, British Columbia, Canada.
  • Yixuan Chen
    Terry Fox Laboratory, BC Cancer Research, Vancouver, British Columbia, Canada.
  • Ryan R Brinkman
    Molecular Biology and Biochemistry Department, Simon Fraser University, Burnaby, BC V5A 1S6, Canada, Terry Fox Laboratory, British Columbia Cancer Agency, Vancouver, BC V5Z 1L3, Canada, Department of Neurology, University at Buffalo School of Medicine and Biomedical Sciences, Buffalo, NY 14203, USA, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA, Institute for Immunity, Transplantation and Infection, Stanford University School of Medicine, Stanford, CA 94305, USA, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD 20892, USA, Center for Human Immunology, Autoimmunity and Inflammation, National Institutes of Health, Bethesda, MD 20892, USA, School of Dental Medicine, University at Buffalo, NY 14214-8006, USA, J. Craig Venter Institute, La Jolla, CA 92037, USA, Department of Pathology, University of California, San Diego, CA 92093, USA.