High-Fidelity Synthetic Data Replicates Clinical Prediction Performance in a Million-Patient Diabetes Cohort.
Journal:
Advanced science (Weinheim, Baden-Wurttemberg, Germany)
Published Date:
Mar 16, 2026
Abstract
Synthetic patient data offer a promising avenue for clinical research, but their usefulness depends on preserving statistical fidelity, biomedical plausibility, and patient privacy. To address this, a dual adversarial autoencoder is employed to generate longitudinal synthetic datasets from real-world clinical data of nearly one million individuals with diabetes from the Andalusian Population Health Database. A multi-faceted evaluation assesses data utility in a machine learning task, predicting chronic kidney disease onset, and evaluates the biomedical plausibility of generated disease trajectories. Models trained exclusively on synthetic data demonstrate predictive performance comparable to those trained on real data and show stability in feature importance rankings, indicating clinical coherence. However, bias and domain-specific sex-stratified analyses reveal inconsistencies not discernible through standard metrics, while data augmentation provides no performance benefit, as data saturation is reached given the large source population. These findings demonstrate that while synthetic data can replicate predictive performance, a robust validation framework combining machine learning utility with domain-specific biomedical evaluation is essential. This work supports the use of synthetic data for large-scale, privacy-preserving research to enable a collaborative healthcare data ecosystem.
Authors
Keywords
No keywords available for this article.