EODA: A three-stage efficient outlier detection approach using Boruta-RF feature selection and enhanced KNN-based clustering algorithm.

Journal: PloS one
Published Date:

Abstract

Outlier detection is essential for identifying unusual patterns or observations that significantly deviate from the normal behavior of a dataset. With the rapid growth of data science, the prevalence of anomalies and outliers has increased, which can disrupt system modeling and parameter estimation, leading to inaccurate results. Recently, deep learning-based outlier detection methods have gained significant attention, but their performance is often limited by challenges in parameter selection and the nearest neighbor search. To overcome these limitations, we propose a three-stage Efficient Outlier Detection Approach (named EODA), that not only detects outliers with high accuracy but also emphasizes dataset characteristics. In the first stage, we apply a feature selection algorithm based on the Boruta method and Random Forest to reduce the data size by selecting the most relevant attributes and calculating the highest Z-score of shadow features. In the second stage, we improve the K-nearest neighbors algorithm to enhance the accuracy of nearest neighbor identification in the clustering phase. Finally, the third stage efficiently identifies the most significant outliers within clustered datasets. We evaluate the proposed EODA algorithm across eight UCI machine-learning repository datasets. The results demonstrate the effectiveness of our EODA approach, achieving a Precision of 63.07%, Recall of 82.49%, and an F1-Score of 64.53%, outperforming the existing techniques in the field.

Authors

  • Sunil Kumar
    School of Computer Science, University of Petroleum and Energy Studies, Dehradun, India.
  • Sudeep Varshney
    Department of Computer Science & Engineering, School of Engineering & Technology, Sharda University, Greater Noida, India.
  • Usha Jain
    Department of Computer Science & Engineering, Manipal University Jaipur, Jaipur, India.
  • Prashant Johri
    SCSE, Galgotias University, Greater Noida, Noida, 203201, UP, India.
  • Abdulaziz S Almazyad
    Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, P.O. Box 51178, Riyadh 11543, Saudi Arabia.
  • Ali Wagdy Mohamed
    Operations Research Department, Faculty of Graduate Studies for Statistical Research, Cairo University, Giza 12613, Egypt.
  • Mehdi Hosseinzadeh
    School of Computer Science, Duy Tan University, Da Nang, 550000, Viet Nam; Jadara Research Center, Jadara University, Irbid 21110, Jordan. Electronic address: mehdihosseinzadeh@duytan.edu.vn.
  • Mohammad Shokouhifar
    DTU AI and Data Science Hub (DAIDASH), Duy Tan University, Da Nang, 550000, Viet Nam.