Evaluation of Synthetic Data Generation Methods for Medical Tabular Data: Representation of Distribution Tails.
Journal:
Studies in health technology and informatics
Published Date:
Aug 7, 2025
Abstract
Synthetic data generation by Artificial Intelligence (AI) and other means has the potential to share and analyze data while preserving privacy and maintaining statistical characteristics, and various methods have been developed. In medical datasets, abnormal values are more critical than normal values for identifying diseases, making the accurate representation of distribution tails essential. However, existing evaluations of synthetic data generation methods often have not focused on distribution tails. This study generated synthetic data from actual specimen test results at a university hospital and analyzed the representation of distribution tails. As a result, we found that the Forest Diffusion model better represents the tails of distribution characteristics of the original data than the Gaussian Copula model or the Conditional generative adversarial networks (CTGAN) model. As distribution tails vary significantly across generation methods, careful consideration of tail characteristics is crucial when generating synthetic medical data.