CoSimGen: Controllable Diffusion Model for Simultaneous Image and Mask Generation
Journal:
arXiv
Published Date:
Mar 25, 2025
Abstract
The acquisition of annotated datasets with paired images and segmentation
masks is a critical challenge in domains such as medical imaging, remote
sensing, and computer vision. Manual annotation demands significant resources,
faces ethical constraints, and depends heavily on domain expertise. Existing
generative models often target single-modality outputs, either images or
segmentation masks, failing to address the need for high-quality, simultaneous
image-mask generation. Additionally, these models frequently lack adaptable
conditioning mechanisms, restricting control over the generated outputs and
limiting their applicability for dataset augmentation and rare scenario
simulation. We propose CoSimGen, a diffusion-based framework for controllable
simultaneous image and mask generation. Conditioning is intuitively achieved
through (1) text prompts grounded in class semantics, (2) spatial embedding of
context prompts to provide spatial coherence, and (3) spectral embedding of
timestep information to model noise levels during diffusion. To enhance
controllability and training efficiency, the framework incorporates contrastive
triplet loss between text and class embeddings, alongside diffusion and
adversarial losses. Initial low-resolution outputs 128 x 128 are super-resolved
to 512 x 512, producing high-fidelity images and masks with strict adherence to
conditions. We evaluate CoSimGen on metrics such as FID, KID, LPIPS, Class FID,
Positive predicted value for image fidelity and semantic alignment of generated
samples over 4 diverse datasets. CoSimGen achieves state-of-the-art performance
across all datasets, achieving the lowest KID of 0.11 and LPIPS of 0.53 across
datasets.