Difference Inversion: Interpolate and Isolate the Difference with Token Consistency for Image Analogy Generation
Journal:
arXiv
Published Date:
Jun 9, 2025
Abstract
How can we generate an image B' that satisfies A:A'::B:B', given the input
images A,A' and B? Recent works have tackled this challenge through approaches
like visual in-context learning or visual instruction. However, these methods
are typically limited to specific models (e.g. InstructPix2Pix. Inpainting
models) rather than general diffusion models (e.g. Stable Diffusion, SDXL).
This dependency may lead to inherited biases or lower editing capabilities. In
this paper, we propose Difference Inversion, a method that isolates only the
difference from A and A' and applies it to B to generate a plausible B'. To
address model dependency, it is crucial to structure prompts in the form of a
"Full Prompt" suitable for input to stable diffusion models, rather than using
an "Instruction Prompt". To this end, we accurately extract the Difference
between A and A' and combine it with the prompt of B, enabling a plug-and-play
application of the difference. To extract a precise difference, we first
identify it through 1) Delta Interpolation. Additionally, to ensure accurate
training, we propose the 2) Token Consistency Loss and 3) Zero Initialization
of Token Embeddings. Our extensive experiments demonstrate that Difference
Inversion outperforms existing baselines both quantitatively and qualitatively,
indicating its ability to generate more feasible B' in a model-agnostic manner.