The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation
Journal:
arXiv
Published Date:
Feb 11, 2025
Abstract
Generative models, particularly text-to-image (T2I) diffusion models, play a
crucial role in medical image analysis. However, these models are prone to
training data memorization, posing significant risks to patient privacy.
Synthetic chest X-ray generation is one of the most common applications in
medical image analysis with the MIMIC-CXR dataset serving as the primary data
repository for this task. This study presents the first systematic attempt to
identify prompts and text tokens in MIMIC-CXR that contribute the most to
training data memorization. Our analysis reveals two unexpected findings: (1)
prompts containing traces of de-identification procedures (markers introduced
to hide Protected Health Information) are the most memorized, and (2) among all
tokens, de-identification markers contribute the most towards memorization.
This highlights a broader issue with the standard anonymization practices and
T2I synthesis with MIMIC-CXR. To exacerbate, existing inference-time
memorization mitigation strategies are ineffective and fail to sufficiently
reduce the model's reliance on memorized text tokens. On this front, we propose
actionable strategies for different stakeholders to enhance privacy and improve
the reliability of generative models in medical imaging. Finally, our results
provide a foundation for future work on developing and benchmarking
memorization mitigation techniques for synthetic chest X-ray generation using
the MIMIC-CXR dataset. The anonymized code is available at
https://anonymous.4open.science/r/diffusion_memorization-8011/