HOI-PAGE: Zero-Shot Human-Object Interaction Generation with Part Affordance Guidance
Journal:
arXiv
Published Date:
Jun 8, 2025
Abstract
We present HOI-PAGE, a new approach to synthesizing 4D human-object
interactions (HOIs) from text prompts in a zero-shot fashion, driven by
part-level affordance reasoning. In contrast to prior works that focus on
global, whole body-object motion for 4D HOI synthesis, we observe that
generating realistic and diverse HOIs requires a finer-grained understanding --
at the level of how human body parts engage with object parts. We thus
introduce Part Affordance Graphs (PAGs), a structured HOI representation
distilled from large language models (LLMs) that encodes fine-grained part
information along with contact relations. We then use these PAGs to guide a
three-stage synthesis: first, decomposing input 3D objects into geometric
parts; then, generating reference HOI videos from text prompts, from which we
extract part-based motion constraints; finally, optimizing for 4D HOI motion
sequences that not only mimic the reference dynamics but also satisfy
part-level contact constraints. Extensive experiments show that our approach is
flexible and capable of generating complex multi-object or multi-person
interaction sequences, with significantly improved realism and text alignment
for zero-shot 4D HOI generation.