Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models
Journal:
arXiv
Published Date:
Jun 12, 2025
Abstract
Latent Diffusion Models have shown remarkable results in text-guided image
synthesis in recent years. In the domain of natural (RGB) images, recent works
have shown that such models can be adapted to various vision-language
downstream tasks with little to no supervision involved. On the contrary,
text-to-image Latent Diffusion Models remain relatively underexplored in the
field of medical imaging, primarily due to limited data availability (e.g., due
to privacy concerns). In this work, focusing on the chest X-ray modality, we
first demonstrate that a standard text-conditioned Latent Diffusion Model has
not learned to align clinically relevant information in free-text radiology
reports with the corresponding areas of the given scan. Then, to alleviate this
issue, we propose a fine-tuning framework to improve multi-modal alignment in a
pre-trained model such that it can be efficiently repurposed for downstream
tasks such as phrase grounding. Our method sets a new state-of-the-art on a
standard benchmark dataset (MS-CXR), while also exhibiting robust performance
on out-of-distribution data (VinDr-CXR). Our code will be made publicly
available.