High-Fidelity Synthetic Data Replicates Clinical Prediction Performance in a Million-Patient Diabetes Cohort.

Journal: Advanced science (Weinheim, Baden-Wurttemberg, Germany)

Published Date: Mar 16, 2026

Abstract

Synthetic patient data offer a promising avenue for clinical research, but their usefulness depends on preserving statistical fidelity, biomedical plausibility, and patient privacy. To address this, a dual adversarial autoencoder is employed to generate longitudinal synthetic datasets from real-world clinical data of nearly one million individuals with diabetes from the Andalusian Population Health Database. A multi-faceted evaluation assesses data utility in a machine learning task, predicting chronic kidney disease onset, and evaluates the biomedical plausibility of generated disease trajectories. Models trained exclusively on synthetic data demonstrate predictive performance comparable to those trained on real data and show stability in feature importance rankings, indicating clinical coherence. However, bias and domain-specific sex-stratified analyses reveal inconsistencies not discernible through standard metrics, while data augmentation provides no performance benefit, as data saturation is reached given the large source population. These findings demonstrate that while synthetic data can replicate predictive performance, a robust validation framework combining machine learning utility with domain-specific biomedical evaluation is essential. This work supports the use of synthetic data for large-scale, privacy-preserving research to enable a collaborative healthcare data ecosystem.

High-Fidelity Synthetic Data Replicates Clinical Prediction Performance in a Million-Patient Diabetes Cohort.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

High-Fidelity Synthetic Data Replicates Clinical Prediction Performance in a Million-Patient Diabetes Cohort.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals