Open-Source Synthetic Data Generation of Clinical Routine Data.

Journal: Studies in health technology and informatics
Published Date:

Abstract

Clinical routine data is a valuable resource for research and analysis in hospitals. Even though regulations allow access within a clinical system, decentralized research with up-to-date data remains a problem limited by privacy concerns. Synthetic data generation can elevate research possibilities by providing statistically similar cohort data without neglecting privacy issues. We used an open-source software package to generate synthetic data for our mixed tabular, event-based electronic patient records. Handling these specific data imposes a rigorous challenge on the chosen package regarding the data quality, complexity, and overall structure. After preprocessing, data cleansing, and division we combined static and time-series-based generative algorithms to create a synthetic dataset. The evaluation is based on the similarity of marginal distributions. While showing potential in some cases it became evident that more sophisticated work has to be done to create datasets that mimic the whole range of the available clinical routine data.

Authors

  • Michael Grössler
    Institute for Applied Medical Informatics, University Medical Center Hamburg-Eppendorf.
  • Frank Ückert
    Institute for Applied Medical Informatics, Hamburg University Hospital, Hamburg, Germany.
  • Layla Tabea Riemann
    Physikalisch-Technische Bundesanstalt, Berlin, Germany.