Knowledge Graph-based Retrieval-Augmented Generation for Schema Matching
Journal:
arXiv
Published Date:
Jan 15, 2025
Abstract
Traditional similarity-based schema matching methods are incapable of
resolving semantic ambiguities and conflicts in domain-specific complex mapping
scenarios due to missing commonsense and domain-specific knowledge. The
hallucination problem of large language models (LLMs) also makes it challenging
for LLM-based schema matching to address the above issues. Therefore, we
propose a Knowledge Graph-based Retrieval-Augmented Generation model for Schema
Matching, referred to as the KG-RAG4SM. In particular, KG-RAG4SM introduces
novel vector-based, graph traversal-based, and query-based graph retrievals, as
well as a hybrid approach and ranking schemes that identify the most relevant
subgraphs from external large knowledge graphs (KGs). We showcase that KG-based
retrieval-augmented LLMs are capable of generating more accurate results for
complex matching cases without any re-training. Our experimental results show
that KG-RAG4SM outperforms the LLM-based state-of-the-art (SOTA) methods (e.g.,
Jellyfish-8B) by 35.89% and 30.50% in terms of precision and F1 score on the
MIMIC dataset, respectively; KG-RAG4SM with GPT-4o-mini outperforms the
pre-trained language model (PLM)-based SOTA methods (e.g., SMAT) by 69.20% and
21.97% in terms of precision and F1 score on the Synthea dataset, respectively.
The results also demonstrate that our approach is more efficient in end-to-end
schema matching, and scales to retrieve from large KGs. Our case studies on the
dataset from the real-world schema matching scenario exhibit that the
hallucination problem of LLMs for schema matching is well mitigated by our
solution.