CDE-Mapper: Using Retrieval-Augmented Language Models for Linking Clinical Data Elements to Controlled Vocabularies
Journal:
arXiv
Published Date:
May 7, 2025
Abstract
The standardization of clinical data elements (CDEs) aims to ensure
consistent and comprehensive patient information across various healthcare
systems. Existing methods often falter when standardizing CDEs of varying
representation and complex structure, impeding data integration and
interoperability in clinical research. We introduce CDE-Mapper, an innovative
framework that leverages Retrieval-Augmented Generation approach combined with
Large Language Models to automate the linking of CDEs to controlled
vocabularies. Our modular approach features query decomposition to manage
varying levels of CDEs complexity, integrates expert-defined rules within
prompt engineering, and employs in-context learning alongside multiple
retriever components to resolve terminological ambiguities. In addition, we
propose a knowledge reservoir validated by a human-in-loop approach, achieving
accurate concept linking for future applications while minimizing computational
costs. For four diverse datasets, CDE-Mapper achieved an average of 7.2\%
higher accuracy improvement compared to baseline methods. This work highlights
the potential of advanced language models in improving data harmonization and
significantly advancing capabilities in clinical decision support systems and
research.