Evaluation of Synthetic Data Generation Methods for Medical Tabular Data: Representation of Distribution Tails.

Journal: Studies in health technology and informatics

Published Date: Aug 7, 2025

Abstract

Synthetic data generation by Artificial Intelligence (AI) and other means has the potential to share and analyze data while preserving privacy and maintaining statistical characteristics, and various methods have been developed. In medical datasets, abnormal values are more critical than normal values for identifying diseases, making the accurate representation of distribution tails essential. However, existing evaluations of synthetic data generation methods often have not focused on distribution tails. This study generated synthetic data from actual specimen test results at a university hospital and analyzed the representation of distribution tails. As a result, we found that the Forest Diffusion model better represents the tails of distribution characteristics of the original data than the Gaussian Copula model or the Conditional generative adversarial networks (CTGAN) model. As distribution tails vary significantly across generation methods, careful consideration of tail characteristics is crucial when generating synthetic medical data.

Authors

Ohmi Mohri

Department of Healthcare Information Management, The University of Tokyo Hospital.
Tomohisa Seki

Department of Cardiology, Keio University School of Medicine, 35 Shinanomachi, Shinjuku-ku, Tokyo 160-8582, Japan; Department of Emergency and Critical Care Medicine, Keio University School of Medicine, Tokyo 160-8582, Japan.
Yoshimasa Kawazoe

Department of Biomedical Informatics, Graduate School of Medicine, The University of Tokyo, Bunkyo-ku, Tokyo, Japan.
Kazuhiko Ohe

Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.

Keywords

Artificial Intelligence Electronic Health Records Humans

External Resources

View on PubMed Access via DOI PubMed (40775942)

Evaluation of Synthetic Data Generation Methods for Medical Tabular Data: Representation of Distribution Tails.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Evaluation of Synthetic Data Generation Methods for Medical Tabular Data: Representation of Distribution Tails.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals