Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation
Journal:
arXiv
Published Date:
May 2, 2025
Abstract
Generative models have revolutionized Artificial Intelligence (AI),
particularly in multimodal applications. However, adapting these models to the
medical domain poses unique challenges due to the complexity of medical data
and the stringent need for clinical accuracy. In this work, we introduce a
framework specifically designed for multimodal medical data generation. By
enabling the generation of multi-view chest X-rays and their associated
clinical report, it bridges the gap between general-purpose vision-language
models and the specialized requirements of healthcare. Leveraging the MIMIC-CXR
dataset, the proposed framework shows superior performance in generating
high-fidelity images and semantically coherent reports. Our quantitative
evaluation reveals significant results in terms of FID and BLEU scores,
showcasing the quality of the generated data. Notably, our framework achieves
comparable or even superior performance compared to real data on downstream
disease classification tasks, underlining its potential as a tool for medical
research and diagnostics. This study highlights the importance of
domain-specific adaptations in enhancing the relevance and utility of
generative models for clinical applications, paving the way for future
advancements in synthetic multimodal medical data generation.