Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests.

Journal: Bioinformatics (Oxford, England)
Published Date:

Abstract

MOTIVATION: Classification of individuals into disease or clinical categories from high-dimensional biological data with low prediction error is an important challenge of statistical learning in bioinformatics. Feature selection can improve classification accuracy but must be incorporated carefully into cross-validation to avoid overfitting. Recently, feature selection methods based on differential privacy, such as differentially private random forests and reusable holdout sets, have been proposed. However, for domains such as bioinformatics, where the number of features is much larger than the number of observations p≫n , these differential privacy methods are susceptible to overfitting.

Authors

  • Trang T Le
    Department of Biostatistics, Epidemiology, and Informatics.
  • W Kyle Simmons
    Laureate Institute for Brain Research, Tulsa, OK 74136, USA.
  • Masaya Misaki
    Laureate Institute for Brain Research, Tulsa, OK 74136, USA.
  • Jerzy Bodurka
    Laureate Institute for Brain Research, Tulsa, OK 74136, USA.
  • Bill C White
    Tandy School of Computer Science, University of Tulsa, OK 74104, USA.
  • Jonathan Savitz
    Laureate Institute for Brain Research, Tulsa, OK 74136, USA.
  • Brett A McKinney
    Department of Mathematics, University of Tulsa, Tulsa, OK 74104, USA.