AST-Enhanced or AST-Overloaded? The Surprising Impact of Hybrid Graph Representations on Code Clone Detection
Journal:
arXiv
Published Date:
Jun 17, 2025
Abstract
As one of the most detrimental code smells, code clones significantly
increase software maintenance costs and heighten vulnerability risks, making
their detection a critical challenge in software engineering. Abstract Syntax
Trees (ASTs) dominate deep learning-based code clone detection due to their
precise syntactic structure representation, but they inherently lack semantic
depth. Recent studies address this by enriching AST-based representations with
semantic graphs, such as Control Flow Graphs (CFGs) and Data Flow Graphs
(DFGs). However, the effectiveness of various enriched AST-based
representations and their compatibility with different graph-based machine
learning techniques remains an open question, warranting further investigation
to unlock their full potential in addressing the complexities of code clone
detection. In this paper, we present a comprehensive empirical study to
rigorously evaluate the effectiveness of AST-based hybrid graph representations
in Graph Neural Network (GNN)-based code clone detection. We systematically
compare various hybrid representations ((CFG, DFG, Flow-Augmented ASTs
(FA-AST)) across multiple GNN architectures. Our experiments reveal that hybrid
representations impact GNNs differently: while AST+CFG+DFG consistently
enhances accuracy for convolution- and attention-based models (Graph
Convolutional Networks (GCN), Graph Attention Networks (GAT)), FA-AST
frequently introduces structural complexity that harms performance. Notably,
GMN outperforms others even with standard AST representations, highlighting its
superior cross-code similarity detection and reducing the need for enriched
structures.