A comparison of random forest variable selection methods for regression modeling of continuous outcomes.

Journal: Briefings in bioinformatics
PMID:

Abstract

Random forest (RF) regression is popular machine learning method to develop prediction models for continuous outcomes. Variable selection, also known as feature selection or reduction, involves selecting a subset of predictor variables for modeling. Potential benefits of variable selection are methodologic (i.e. improving prediction accuracy and computational efficiency) and practical (i.e. reducing the burden of data collection and improving efficiency). Several variable selection methods leveraging RFs have been proposed, but there is limited evidence to guide decisions on which methods may be preferable for different types of datasets with continuous outcomes. Using 59 publicly available datasets in a benchmarking study, we evaluated the implementation of 13 RF variable selection methods. Performance of variable selection was measured via out-of-sample R2 of a RF that used the variables selected for each method. Simplicity of variable selection was measured via the percent reduction in the number of variables selected out of the number of variables available. Efficiency was measured via computational time required to complete the variable selection. Based on our benchmarking study, variable selection methods implemented in the Boruta and aorsf R packages selected the best subset of variables for axis-based RF models, whereas methods implemented in the aorsf R package selected the best subset of variables for oblique RF models. A significant contribution of this study is the ability to assess different variable selection methods in the setting of RF regression for continuous outcomes to identify preferable methods using an open science approach.

Authors

  • Nathaniel S O'Connell
    Department of Biostatistics and Data Science, Wake Forest University School of Medicine, Medical Center Boulevard, Winston-Salem, NC 27157, United States.
  • Byron C Jaeger
    Kirklin Institute for Research in Surgical Outcomes, University of Alabama at Birmingham.
  • Garrett S Bullock
    Department of Orthopaedic Surgery, Wake Forest School of Medicine, Winston-Salem, NC, USA; Centre for Sport, Exercise and Osteoarthritis Research Versus Arthritis, University of Oxford, Oxford, United Kingdom.
  • Jaime Lynn Speiser
    Department of Biostatistical Sciences, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA.