Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation
Journal:
arXiv
Published Date:
Mar 26, 2025
Abstract
The diffusion model has demonstrated superior performance in synthesizing
diverse and high-quality images for text-guided image translation. However,
there remains room for improvement in both the formulation of text prompts and
the preservation of reference image content. First, variations in target text
prompts can significantly influence the quality of the generated images, and it
is often challenging for users to craft an optimal prompt that fully captures
the content of the input image. Second, while existing models can introduce
desired modifications to specific regions of the reference image, they
frequently induce unintended alterations in areas that should remain unchanged.
To address these challenges, we propose pix2pix-zeroCon, a zero-shot
diffusion-based method that eliminates the need for additional training by
leveraging patch-wise contrastive loss. Specifically, we automatically
determine the editing direction in the text embedding space based on the
reference image and target prompts. Furthermore, to ensure precise content and
structural preservation in the edited image, we introduce cross-attention
guiding loss and patch-wise contrastive loss between the generated and original
image embeddings within a pre-trained diffusion model. Notably, our approach
requires no additional training and operates directly on a pre-trained
text-to-image diffusion model. Extensive experiments demonstrate that our
method surpasses existing models in image-to-image translation, achieving
enhanced fidelity and controllability.