Mono: Is Your "Clean" Vulnerability Dataset Really Solvable? Exposing and Trapping Undecidable Patches and Beyond
Journal:
arXiv
Published Date:
Jun 4, 2025
Abstract
The quantity and quality of vulnerability datasets are essential for
developing deep learning solutions to vulnerability-related tasks. Due to the
limited availability of vulnerabilities, a common approach to building such
datasets is analyzing security patches in source code. However, existing
security patches often suffer from inaccurate labels, insufficient contextual
information, and undecidable patches that fail to clearly represent the root
causes of vulnerabilities or their fixes. These issues introduce noise into the
dataset, which can mislead detection models and undermine their effectiveness.
To address these issues, we present mono, a novel LLM-powered framework that
simulates human experts' reasoning process to construct reliable vulnerability
datasets. mono introduces three key components to improve security patch
datasets: (i) semantic-aware patch classification for precise vulnerability
labeling, (ii) iterative contextual analysis for comprehensive code
understanding, and (iii) systematic root cause analysis to identify and filter
undecidable patches. Our comprehensive evaluation on the MegaVul benchmark
demonstrates that mono can correct 31.0% of labeling errors, recover 89% of
inter-procedural vulnerabilities, and reveals that 16.7% of CVEs contain
undecidable patches. Furthermore, mono's enriched context representation
improves existing models' vulnerability detection accuracy by 15%. We open
source the framework mono and the dataset MonoLens in
https://github.com/vul337/mono.