Reward-Guided Generation Improves the Scientific Utility of Synthetic Biomedical Data

Journal: medRxiv

Published Date: Mar 16, 2026

Abstract

Synthetic data generation is a promising approach for biomedical data sharing and dataset augmentation, yet existing methods lack mechanisms to preserve statistical properties necessary for scientific analysis. To address this, we introduce RLSYN+REG, a reinforcement learning-driven generative model, which encourages that regression models trained on synthetic data reproduce the coefficients and predictions of their real-data counterparts. We evaluate RLSYN+REG on MIMIC-III and the American Community Survey (ACS) across regression model reproduction, fidelity to real data, and privacy. Synthetic data from RLSYN+REG substantially improves upon that of RLSYN, raising correlations between real and synthetic regression coefficients from 0.054 to 0.600 on MIMIC-III and from 0.160 to 0.376 on ACS. Predictive performance also improves, reducing the gap between real-data baselines by 81.4% and 97.6% on MIMIC-III and ACS, respectively. These improvements come with negligible cost to fidelity or privacy and are robust to reductions in training data.

Authors

Jackson
N. J.; Espinosa-Dice
N.; Yan
C.; Malin
B. A.

External Resources

View on medRxiv Access via DOI

Reward-Guided Generation Improves the Scientific Utility of Synthetic Biomedical Data

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Reward-Guided Generation Improves the Scientific Utility of Synthetic Biomedical Data

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals