VAP-Diffusion: Enriching Descriptions with MLLMs for Enhanced Medical Image Generation
Journal:
arXiv
Published Date:
Jun 30, 2025
Abstract
As the appearance of medical images is influenced by multiple underlying
factors, generative models require rich attribute information beyond labels to
produce realistic and diverse images. For instance, generating an image of skin
lesion with specific patterns demands descriptions that go beyond diagnosis,
such as shape, size, texture, and color. However, such detailed descriptions
are not always accessible. To address this, we explore a framework, termed
Visual Attribute Prompts (VAP)-Diffusion, to leverage external knowledge from
pre-trained Multi-modal Large Language Models (MLLMs) to improve the quality
and diversity of medical image generation. First, to derive descriptions from
MLLMs without hallucination, we design a series of prompts following
Chain-of-Thoughts for common medical imaging tasks, including dermatologic,
colorectal, and chest X-ray images. Generated descriptions are utilized during
training and stored across different categories. During testing, descriptions
are randomly retrieved from the corresponding category for inference. Moreover,
to make the generator robust to unseen combination of descriptions at the test
time, we propose a Prototype Condition Mechanism that restricts test embeddings
to be similar to those from training. Experiments on three common types of
medical imaging across four datasets verify the effectiveness of VAP-Diffusion.