The feature selection bias problem in relation to high-dimensional gene data.

Journal: Artificial intelligence in medicine
Published Date:

Abstract

OBJECTIVE: Feature selection is a technique widely used in data mining. The aim is to select the best subset of features relevant to the problem being considered. In this paper, we consider feature selection for the classification of gene datasets. Gene data is usually composed of just a few dozen objects described by thousands of features. For this kind of data, it is easy to find a model that fits the learning data. However, it is not easy to find one that will simultaneously evaluate new data equally well as learning data. This overfitting issue is well known as regards classification and regression, but it also applies to feature selection.

Authors

  • Jerzy Krawczuk
    Faculty of Computer Science, Bialystok University of Technology, 45A Wiejska St., 15-351 Bialystok, Poland.
  • Tomasz Ɓukaszuk
    Faculty of Computer Science, Bialystok University of Technology, 45A Wiejska St., 15-351 Bialystok, Poland. Electronic address: t.lukaszuk@pb.edu.pl.