Improving internet of vehicles research: A systematic preprocessing framework for the VeReMi dataset.

Journal: Data in brief
Published Date:

Abstract

The Vehicular Reference Misbehavior Dataset (VeReMi) is a vital resource for advancing Intelligent Transportation Systems (ITS) and the Internet of Vehicles (IoV). However, its large size (∼7 GB) and inherent class imbalance pose significant challenges for machine learning model development. This paper presents a preprocessing framework to enhance VeReMi's usability and relevance. Through 10 % down-sampling, the dataset was reduced to ∼724MB, making it computationally manageable. Biases were addressed by balancing benign and malicious samples through synthesis and identifying benign instances using predefined criteria. A refined feature set, including key attributes like and (renamed ), was selected to improve machine learning compatibility. This preprocessing pipeline effectively maintains data integrity and preserves the representativeness of malicious patterns. The optimized dataset is well-suited for ITS and IoV applications, such as anomaly detection and network security, underscoring the crucial role of preprocessing in overcoming real-world constraints and enhancing model performance.

Authors

  • Aparup Roy
    Bachelor of Science (B.S.) in Data Science and Applications (Pursuing), Indian Institute of Technology Madras, BS Degree Office, 3rd Floor, ICSR Building, IIT Madras, Chennai 600036, India.
  • Debotosh Bhattacharjee
    ∥Department of Computer Science and Engineering, Jadavpur University, Kolkata-700032, West Bengal, India.
  • Ondrej Krejcar
    Center for Basic and Applied Research, Faculty of Informatics and Management, University of Hradec Kralove, Rokitanskeho 62, Hradec Kralove 500 03, Czech Republic. ondrej.krejcar@uhk.cz.

Keywords

No keywords available for this article.