Applying AI to Support Categorization of Heterogeneous Epidemiological Datasets.
Journal:
Studies in health technology and informatics
Published Date:
May 15, 2025
Abstract
The significance of Findable, Accessible, Interoperable, and Reusable (FAIR) data is increasing, particularly in the context of enhancing data reuse in research. The National Research Data Infrastructure for Personal Health Data (NFDI4Health) aims to enhance the findability, reusability, and interoperability of health data derived from epidemiological, clinical, and public health studies. NFDI4Health has established the German Central Health Study Hub to improve health data findability through rich metadata. The Maelstrom Catalog, provided by Maelstrom Research, offers a comprehensive dataset of labeled and harmonized study variables, thereby enhancing the findability and reusability of epidemiological data. Both platforms rely on standardized categorization to optimize data reuse. To facilitate this process, NFDI4Health developed the Metadata Annotation Workbench, which supports metadata annotation with standardized vocabulary. This paper presents an AI solution for automatic classification and annotation integrated into this service, using a BioBERT-based text classifier. The model achieved a weighted F1-score of over 92% and demonstrated improved annotation performance, particularly for non-experts. It accelerates variable categorization, thereby enhancing data findability and re-use. As a result, the categorization of study variables can be accelerated and we are confident that the further development of such AI approaches will reduce curatorial workload and promote semantically annotated interoperable data catalogs.