Quality metrics of synthetic radiomics data do not predict improvement under simulated external validation: an ecological fallacy across 50 public datasets.

Journal: European radiology
Published Date:

Abstract

OBJECTIVES: To investigate whether established synthetic data quality metrics predict when deep generative augmentation improves performance under simulated external validation in radiomics. MATERIALS AND METHODS: Three conditional generators (WGAN-GP, CVAE, TabDDPM) were trained on 50 public binary-classification radiomic datasets from radMLBench. For each dataset-generator pair (n = 150), five quality metrics and ΔAUC (change in simulated external AUC under domain shift) were recorded across 10 repetitions; undefined AUCs (single-class test folds) were excluded a priori (30/2500 external, 1.2%). Six analyses tested quality-ΔAUC associations: rank correlations with FDR correction, ROC discrimination, subgroup random-effects meta-analysis, composite scoring, quality-guided generator selection, and SMOTE comparison. All experiments were replicated with a random forest classifier. RESULTS: Pooled quality metrics correlated significantly with ΔAUC (|ρ| = 0.30-0.36, all adjusted p < 0.001); within each generator, all correlations were non-significant (all p > 0.05), revealing an ecological fallacy. After Benjamini-Hochberg correction, 5 of 20 correlations survived: the 4 pooled associations and DDPM MMD (unexpected direction). ROC-AUC ranged from 0.42 to 0.61 (near-chance). Quality-guided selection yielded a significantly negative pooled Δ (-0.0051; 95% CI: -0.0083 to -0.0019) and improved external AUC in only 17/50 (34%) datasets versus 34/50 (68%) for an oracle. Random forest replication rendered pooled correlations non-significant (all p > 0.09). A pre-specified sensitivity analysis on the 29/50 datasets statistically distinguishable from chance produced identical conclusions. CONCLUSION: Aggregate quality-performance correlations are driven by between-generator differences, not within-generator variation that could guide practical decisions. Quality metrics are insufficient proxies for clinical utility; task-specific external validation remains indispensable. KEY POINTS: Question Can established quality metrics of synthetic radiomics data identify when generative augmentation actually improves the performance of prediction models under simulated external validation? Findings Across 50 public datasets, aggregate quality-performance correlations vanished within individual generators (ecological fallacy), and quality-guided generator selection underperformed real-only training. Clinical relevance Quality metrics commonly used to validate synthetic radiomics data cannot identify when augmentation improves clinical model performance under simulated domain shift, underscoring the irreplaceable role of task-specific external validation before clinical deployment.

Authors

Keywords

No keywords available for this article.