Toward a Synthetic Data Revolution: Diffusion Model-Enhanced Hepatocellular Carcinoma Prediction in Steatotic Liver Disease.

Journal: Hepatology research : the official journal of the Japan Society of Hepatology
Published Date:

Abstract

AIM: Steatotic liver disease (SLD) encompasses a heterogeneous spectrum with varying risks of hepatocellular carcinoma (HCC). Limited sample sizes limit the development of predictive models, particularly for rare outcomes. This study evaluated whether generative artificial intelligence (AI)-based synthetic data augmentation can enhance HCC risk prediction in SLD patients. METHODS: A random survival forest (RSF) model was developed using data from 406 patients with biopsy-confirmed SLD. The dataset was divided into training (n = 284) and testing (n = 122) cohorts. Two synthetic data generation approaches, the conditional tabular generative adversarial network (CTGAN; a type of generative adversarial network [GAN]) and diffusion models, augmented the training dataset from 284 to 1000 samples. Model performance was assessed using Harrell's C-index and integrated Brier score (IBS). Synthetic data quality was evaluated using maximum mean discrepancy (MMD) and Wasserstein distance. RESULTS: During the mean follow-up of 5.9 years, 12 (3.0%) patients developed HCC. In the test cohort, the baseline RSF model achieved a C-index of 0.912. Following augmentation, the diffusion-augmented model improved to 0.949, whereas the GAN-augmented model decreased to 0.818. Diffusion-generated data showed superior fidelity with a lower MMD (0.0303 vs. 0.0762) and Wasserstein distance (0.0467 vs. 0.1145) than GAN-generated data. Both augmentation approaches improved calibration (IBS: diffusion, 0.0103; GAN, 0.0108 vs. baseline, 0.0114). CONCLUSIONS: Diffusion-based synthetic data augmentation improved HCC risk prediction in the test cohort, whereas GAN augmentation reduced model accuracy. These findings suggest that diffusion models may help address the data scarcity challenges in hepatology research, potentially providing a useful approach for developing predictive models in limited cohorts.

Authors

Keywords

No keywords available for this article.