Applying AI to Support Categorization of Heterogeneous Epidemiological Datasets.

Journal: Studies in health technology and informatics

Published Date: May 15, 2025

Abstract

The significance of Findable, Accessible, Interoperable, and Reusable (FAIR) data is increasing, particularly in the context of enhancing data reuse in research. The National Research Data Infrastructure for Personal Health Data (NFDI4Health) aims to enhance the findability, reusability, and interoperability of health data derived from epidemiological, clinical, and public health studies. NFDI4Health has established the German Central Health Study Hub to improve health data findability through rich metadata. The Maelstrom Catalog, provided by Maelstrom Research, offers a comprehensive dataset of labeled and harmonized study variables, thereby enhancing the findability and reusability of epidemiological data. Both platforms rely on standardized categorization to optimize data reuse. To facilitate this process, NFDI4Health developed the Metadata Annotation Workbench, which supports metadata annotation with standardized vocabulary. This paper presents an AI solution for automatic classification and annotation integrated into this service, using a BioBERT-based text classifier. The model achieved a weighted F1-score of over 92% and demonstrated improved annotation performance, particularly for non-experts. It accelerates variable categorization, thereby enhancing data findability and re-use. As a result, the categorization of study variables can be accelerated and we are confident that the further development of such AI approaches will reduce curatorial workload and promote semantically annotated interoperable data catalogs.

Authors

Julia Sasse

ZB MED - Information Centre for Life Sciences, Cologne, Germany, https://ror.org/0259fwx54.
Guillaume Fabre

Maelstrom Research, Research Institute of the McGill University Health Centre, Montreal, Canada.
Isabel Fortier

Maelstrom Research, Research Institute of the McGill University Health Centre, Montreal, Canada.
Pierre Zimmermann

University of Bonn, Germany, https://ror.org/041nas322.
Juliane Fluck

Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany sumit.madan@scai.fraunhofer.de juliane.fluck@scai.fraunhofer.de.

Keywords

Artificial Intelligence Datasets as Topic Electronic Health Records Germany Health Information Interoperability Humans Metadata Natural Language Processing

External Resources

View on PubMed Access via DOI PubMed (40380587)

Applying AI to Support Categorization of Heterogeneous Epidemiological Datasets.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals