Positive region preserved random sampling: an efficient feature selection method for massive data
Journal:
arXiv
Published Date:
Jul 1, 2025
Abstract
Selecting relevant features is an important and necessary step for
intelligent machines to maximize their chances of success. However, intelligent
machines generally have no enough computing resources when faced with huge
volume of data. This paper develops a new method based on sampling techniques
and rough set theory to address the challenge of feature selection for massive
data. To this end, this paper proposes using the ratio of discernible object
pairs to all object pairs that should be distinguished to measure the
discriminatory ability of a feature set. Based on this measure, a new feature
selection method is proposed. This method constructs positive region preserved
samples from massive data to find a feature subset with high discriminatory
ability. Compared with other methods, the proposed method has two advantages.
First, it is able to select a feature subset that can preserve the
discriminatory ability of all the features of the target massive data set
within an acceptable time on a personal computer. Second, the lower boundary of
the probability of the object pairs that can be discerned using the feature
subset selected in all object pairs that should be distinguished can be
estimated before finding reducts. Furthermore, 11 data sets of different sizes
were used to validate the proposed method. The results show that approximate
reducts can be found in a very short period of time, and the discriminatory
ability of the final reduct is larger than the estimated lower boundary.
Experiments on four large-scale data sets also showed that an approximate
reduct with high discriminatory ability can be obtained in reasonable time on a
personal computer.