An Assessment of Lexical, Network, and Content-Based Features for Detecting Malicious URLs Using Machine Learning and Deep Learning Models.

Journal: Computational intelligence and neuroscience

Published Date: Aug 25, 2022

Abstract

The World Wide Web services are essential in our daily lives and are available to communities through Uniform Resource Locator (URL). Attackers utilize such means of communication and create malicious URLs to conduct fraudulent activities and deceive others by creating deceptive and misleading websites and domains. Such threats open the doors for many critical attacks such as spams, spyware, phishing, and malware. Therefore, detecting malicious URL is crucially important to prevent the occurrence of many cybercriminal activities. In this study, we examined a set of machine learning (ML) and deep learning (DL) models to detect malicious websites using a dataset comprising 66,506 records of URLs. We engineered three different types of features including lexical-based, network-based and content-based features. To extract the most discriminative features in the dataset, we applied several features selection algorithms, namely, correlation analysis, Analysis of Variance (ANOVA), and chi-square. Finally, we conducted a comparative performance evaluation for several ML and DL models considering set of criteria commonly used to evaluate such models. Results depicted that Naïve Bayes (NB) was the best model for detecting malicious URLs using the applied data with an accuracy of 96%. This research has made contribution to the field by conducting significant features engineering and analysis to identify the best features for malicious URLs predictions, compare different models and achieve a high accuracy using a large new URL dataset.

Authors

Malak Aljabri

Computer Science Department, College of Computer and Information Systems, Umm Al-Qura University, Makkah 21955, Saudi Arabia.
Fahd Alhaidari

Department of Networks and Communications, College of Computer Science and Information Technology, Imam Abdulrahman Bin Faisal University, P.O. Box 1982, Dammam 31441, Saudi Arabia.
Rami Mustafa A Mohammad

Department of Computer Information Systems, College of Computer Science and Information Technology, Imam Abdulrahman Bin Faisal University, P.O. Box 1982, Dammam 31441, Saudi Arabia.
Samiha Mirza

SAUDI ARAMCO Cybersecurity Chair, Department of Computer Science, College of Computer Science and Information Technology, Imam Abdulrahman Bin Faisal University, P.O. Box 1982, Dammam 31441, Saudi Arabia.
Dina H Alhamed

SAUDI ARAMCO Cybersecurity Chair, Department of Computer Engineering, College of Computer Science and Information Technology, Imam Abdulrahman Bin Faisal University, P.O. Box 1982, Dammam 31441, Saudi Arabia.
Hanan S Altamimi

SAUDI ARAMCO Cybersecurity Chair, Department of Computer Science, College of Computer Science and Information Technology, Imam Abdulrahman Bin Faisal University, P.O. Box 1982, Dammam 31441, Saudi Arabia.
Sara Mhd Bachar Chrouf

SAUDI ARAMCO Cybersecurity Chair, Department of Computer Science, College of Computer Science and Information Technology, Imam Abdulrahman Bin Faisal University, P.O. Box 1982, Dammam 31441, Saudi Arabia.

Keywords

Algorithms Bayes Theorem Deep Learning Machine Learning

External Resources

View on PubMed Access via DOI PubMed (36059391)

An Assessment of Lexical, Network, and Content-Based Features for Detecting Malicious URLs Using Machine Learning and Deep Learning Models.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals