Automated CVE Analysis: Harnessing Machine Learning In Designing Question-Answering Models For Cybersecurity Information Extraction
Journal:
arXiv
Published Date:
Dec 21, 2024
Abstract
The vast majority of cybersecurity information is unstructured text,
including critical data within databases such as CVE, NVD, CWE, CAPEC, and the
MITRE ATT&CK Framework. These databases are invaluable for analyzing attack
patterns and understanding attacker behaviors. Creating a knowledge graph by
integrating this information could unlock significant insights. However,
processing this large amount of data requires advanced deep-learning
techniques. A crucial step towards building such a knowledge graph is
developing a robust mechanism for automating the extraction of answers to
specific questions from the unstructured text. Question Answering (QA) systems
play a pivotal role in this process by pinpointing and extracting precise
information, facilitating the mapping of relationships between various data
points. In the cybersecurity context, QA systems encounter unique challenges
due to the need to interpret and answer questions based on a wide array of
domain-specific information. To tackle these challenges, it is necessary to
develop a cybersecurity-specific dataset and train a machine learning model on
it, aimed at enhancing the understanding and retrieval of domain-specific
information. This paper presents a novel dataset and describes a machine
learning model trained on this dataset for the QA task. It also discusses the
model's performance and key findings in a manner that maintains a balance
between formality and accessibility.