Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Journal: arXiv

Published Date: May 31, 2025

Abstract

Objective: While recent advances in text-conditioned generative models have enabled the synthesis of realistic medical images, progress has been largely confined to 2D modalities such as chest X-rays. Extending text-to-image generation to volumetric Computed Tomography (CT) remains a significant challenge, due to its high dimensionality, anatomical complexity, and the absence of robust frameworks that align vision-language data in 3D medical imaging. Methods: We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme. Our approach leverages a dual-encoder CLIP-style model trained on paired CT volumes and radiology reports to establish a shared embedding space, which serves as the conditioning input for generation. CT volumes are compressed into a low-dimensional latent space via a pretrained volumetric VAE, enabling efficient 3D denoising diffusion without requiring external super-resolution stages. Results: We evaluate our method on the CT-RATE dataset and conduct a comprehensive assessment of image fidelity, clinical relevance, and semantic alignment. Our model achieves competitive performance across all tasks, significantly outperforming prior baselines for text-to-CT generation. Moreover, we demonstrate that CT scans synthesized by our framework can effectively augment real data, improving downstream diagnostic performance. Conclusion: Our results show that modality-specific vision-language alignment is a key component for high-quality 3D medical image generation. By integrating contrastive pretraining and volumetric diffusion, our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text, paving the way for new applications in data augmentation, medical education, and automated clinical simulation.

Authors

Daniele Molino
Camillo Maria Caruso
Filippo Ruffini
Paolo Soda
Valerio Guarrasi

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2506.00633v1)

Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals