How Useful Is Synthetic Data in Developing Predictive Models for Health?

Journal: Studies in health technology and informatics
Published Date:

Abstract

Synthetic data, generated using generative AI techniques, closely mimics the characteristics of real data while enhancing privacy for sensitive health data. This study evaluates synthetic tabular data based on fidelity and utility for predictive models. Fidelity is measured through univariate distribution and bivariate differential pairwise correlations, while utility is measured by comparing machine learning model performance trained on synthetic and real data. Results show highly similar model performance on synthetic and real data. We also explore the potential of using synthetic data for hyperparameter tuning. Our findings reveal a strong correlation between prediction accuracy on synthetic and real data, suggesting that hyperparameters optimized using synthetic data can be effectively applied to models trained on real datasets for optimal results.

Authors

  • Mohammad Ahmed Basri
    System Design Engineering.
  • Helen Chen
    University of Waterloo.