Evaluation of Synthetic Data Generation Methods for Medical Tabular Data: Representation of Distribution Tails.

Journal: Studies in health technology and informatics
Published Date:

Abstract

Synthetic data generation by Artificial Intelligence (AI) and other means has the potential to share and analyze data while preserving privacy and maintaining statistical characteristics, and various methods have been developed. In medical datasets, abnormal values are more critical than normal values for identifying diseases, making the accurate representation of distribution tails essential. However, existing evaluations of synthetic data generation methods often have not focused on distribution tails. This study generated synthetic data from actual specimen test results at a university hospital and analyzed the representation of distribution tails. As a result, we found that the Forest Diffusion model better represents the tails of distribution characteristics of the original data than the Gaussian Copula model or the Conditional generative adversarial networks (CTGAN) model. As distribution tails vary significantly across generation methods, careful consideration of tail characteristics is crucial when generating synthetic medical data.

Authors

  • Ohmi Mohri
    Department of Healthcare Information Management, The University of Tokyo Hospital.
  • Tomohisa Seki
    Department of Cardiology, Keio University School of Medicine, 35 Shinanomachi, Shinjuku-ku, Tokyo 160-8582, Japan; Department of Emergency and Critical Care Medicine, Keio University School of Medicine, Tokyo 160-8582, Japan.
  • Yoshimasa Kawazoe
    Department of Biomedical Informatics, Graduate School of Medicine, The University of Tokyo, Bunkyo-ku, Tokyo, Japan.
  • Kazuhiko Ohe
    Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.