A novel two-stage feature selection method based on random forest and improved genetic algorithm for enhancing classification in machine learning.

Journal: Scientific reports
Published Date:

Abstract

The data acquisition methods are becoming increasingly diverse and advanced, leading to higher data dimensions, blurred classification boundaries, and overfitting datasets, affecting machine learning models' accuracy. Many studies have sought to improve model performance through feature selection. However, a single feature selection method has incomplete, unstable, or time-consuming shortcomings. Combining the advantages of various feature selection methods can help overcome these defects. This paper proposes a two-stage feature selection method based on random forest and improved genetic algorithm. First, the importance scores of the random forest are calculated and ranked, and the features are preliminarily eliminated according to the scores, reducing the time complexity of the subsequent process. Then, the improved genetic algorithm is used to search for the global optimal feature subset further. This process introduces a multi-objective fitness function to guide the feature subset, minimizing the number of features in the subset while enhancing classification accuracy. This paper also adds an adaptive mechanism and evolution strategy to improve the loss of population diversity and degeneration in the later stages of iteration, thereby enhancing search efficiency. The experimental results on eight UCI datasets show that the proposed method significantly improves classification performance and has excellent feature selection capability.

Authors

  • Junyao Ding
    School of Telecommunications Engineering, Xidian University, Xi'an, China.
  • Jianchao Du
    School of Telecommunications Engineering, Xidian University, Xi'an, China.
  • Hejie Wang
    School of Telecommunications Engineering, Xidian University, Xi'an, 710071, China.
  • Song Xiao

Keywords

No keywords available for this article.