fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies.

Journal: BMC bioinformatics
Published Date:

Abstract

BACKGROUND: Parametric feature selection methods for machine learning and association studies based on genetic data are not robust with respect to outliers or influential observations. While rank-based, distribution-free statistics offer a robust alternative to parametric methods, their practical utility can be limited, as they demand significant computational resources when analyzing high-dimensional data. For genetic studies that seek to identify variants, the hypothesis is constrained, since it is typically assumed that the effect of the genotype on the phenotype is monotone (e.g., an additive genetic effect). Similarly, predictors for machine learning applications may have natural ordering constraints. Cross-validation for feature selection in these high-dimensional contexts necessitates highly efficient computational algorithms for the robust evaluation of many features.

Authors

  • Jiaxing Lin
    Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA.
  • Alexander Sibley
    Duke Cancer Institute, Duke University Medical Center, Durham, NC, USA.
  • Ivo Shterev
    Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA.
  • Andrew Nixon
    Duke Cancer Institute, Duke University Medical Center, Durham, NC, USA.
  • Federico Innocenti
    Division of Pharmacotherapy and Experimental Therapeutics, Chapel Hill, NC, USA.
  • Cliburn Chan
    Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA.
  • Kouros Owzar
    Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA. Kouros.Owzar@duke.edu.