Preserve Anything: Controllable Image Synthesis with Object Preservation
Journal:
arXiv
Published Date:
Jun 27, 2025
Abstract
We introduce \textit{Preserve Anything}, a novel method for controlled image
synthesis that addresses key limitations in object preservation and semantic
consistency in text-to-image (T2I) generation. Existing approaches often fail
(i) to preserve multiple objects with fidelity, (ii) maintain semantic
alignment with prompts, or (iii) provide explicit control over scene
composition. To overcome these challenges, the proposed method employs an
N-channel ControlNet that integrates (i) object preservation with size and
placement agnosticism, color and detail retention, and artifact elimination,
(ii) high-resolution, semantically consistent backgrounds with accurate
shadows, lighting, and prompt adherence, and (iii) explicit user control over
background layouts and lighting conditions. Key components of our framework
include object preservation and background guidance modules, enforcing lighting
consistency and a high-frequency overlay module to retain fine details while
mitigating unwanted artifacts. We introduce a benchmark dataset consisting of
240K natural images filtered for aesthetic quality and 18K 3D-rendered
synthetic images with metadata such as lighting, camera angles, and object
relationships. This dataset addresses the deficiencies of existing benchmarks
and allows a complete evaluation. Empirical results demonstrate that our method
achieves state-of-the-art performance, significantly improving feature-space
fidelity (FID 15.26) and semantic alignment (CLIP-S 32.85) while maintaining
competitive aesthetic quality. We also conducted a user study to demonstrate
the efficacy of the proposed work on unseen benchmark and observed a remarkable
improvement of $\sim25\%$, $\sim19\%$, $\sim13\%$, and $\sim14\%$ in terms of
prompt alignment, photorealism, the presence of AI artifacts, and natural
aesthetics over existing works.