SynergyAmodal: Deocclude Anything with Text Control
Journal:
arXiv
Published Date:
Apr 28, 2025
Abstract
Image deocclusion (or amodal completion) aims to recover the invisible
regions (\ie, shape and appearance) of occluded instances in images. Despite
recent advances, the scarcity of high-quality data that balances diversity,
plausibility, and fidelity remains a major obstacle. To address this challenge,
we identify three critical elements: leveraging in-the-wild image data for
diversity, incorporating human expertise for plausibility, and utilizing
generative priors for fidelity. We propose SynergyAmodal, a novel framework for
co-synthesizing in-the-wild amodal datasets with comprehensive shape and
appearance annotations, which integrates these elements through a tripartite
data-human-model collaboration. First, we design an occlusion-grounded
self-supervised learning algorithm to harness the diversity of in-the-wild
image data, fine-tuning an inpainting diffusion model into a partial completion
diffusion model. Second, we establish a co-synthesis pipeline to iteratively
filter, refine, select, and annotate the initial deocclusion results of the
partial completion diffusion model, ensuring plausibility and fidelity through
human expert guidance and prior model constraints. This pipeline generates a
high-quality paired amodal dataset with extensive category and scale diversity,
comprising approximately 16K pairs. Finally, we train a full completion
diffusion model on the synthesized dataset, incorporating text prompts as
conditioning signals. Extensive experiments demonstrate the effectiveness of
our framework in achieving zero-shot generalization and textual
controllability. Our code, dataset, and models will be made publicly available
at https://github.com/imlixinyang/SynergyAmodal.