Open-Source Synthetic Data Generation of Clinical Routine Data.
Journal:
Studies in health technology and informatics
Published Date:
May 15, 2025
Abstract
Clinical routine data is a valuable resource for research and analysis in hospitals. Even though regulations allow access within a clinical system, decentralized research with up-to-date data remains a problem limited by privacy concerns. Synthetic data generation can elevate research possibilities by providing statistically similar cohort data without neglecting privacy issues. We used an open-source software package to generate synthetic data for our mixed tabular, event-based electronic patient records. Handling these specific data imposes a rigorous challenge on the chosen package regarding the data quality, complexity, and overall structure. After preprocessing, data cleansing, and division we combined static and time-series-based generative algorithms to create a synthetic dataset. The evaluation is based on the similarity of marginal distributions. While showing potential in some cases it became evident that more sophisticated work has to be done to create datasets that mimic the whole range of the available clinical routine data.