InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing
Journal:
arXiv
Published Date:
May 30, 2025
Abstract
Recent advances in 3D human-aware generation have made significant progress.
However, existing methods still struggle with generating novel Human Object
Interaction (HOI) from text, particularly for open-set objects. We identify
three main challenges of this task: precise human-object relation reasoning,
affordance parsing for any object, and detailed human interaction pose
synthesis aligning description and object geometry. In this work, we propose a
novel zero-shot 3D HOI generation framework without training on specific
datasets, leveraging the knowledge from large-scale pre-trained models.
Specifically, the human-object relations are inferred from large language
models (LLMs) to initialize object properties and guide the optimization
process. Then we utilize a pre-trained 2D image diffusion model to parse unseen
objects and extract contact points, avoiding the limitations imposed by
existing 3D asset knowledge. The initial human pose is generated by sampling
multiple hypotheses through multi-view SDS based on the input text and object
geometry. Finally, we introduce a detailed optimization to generate
fine-grained, precise, and natural interaction, enforcing realistic 3D contact
between the 3D object and the involved body parts, including hands in grasping.
This is achieved by distilling human-level feedback from LLMs to capture
detailed human-object relations from the text instruction. Extensive
experiments validate the effectiveness of our approach compared to prior works,
particularly in terms of the fine-grained nature of interactions and the
ability to handle open-set 3D objects.