GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation
Journal:
arXiv
Published Date:
May 17, 2025
Abstract
Learning manipulation skills from human demonstration videos offers a
promising path toward generalizable and interpretable robotic
intelligence-particularly through the lens of actionable affordances. However,
transferring such knowledge remains challenging due to: 1) a lack of
large-scale datasets with precise affordance annotations, and 2) insufficient
exploration of affordances in diverse manipulation contexts. To address these
gaps, we introduce HOVA-500K, a large-scale, affordance-annotated dataset
comprising 500,000 images across 1,726 object categories and 675 actions. We
also release a standardized benchmarking suite for multi-modal affordance
reasoning. Built upon HOVA-500K, we present GLOVER++, a global-to-local
affordance training framework that effectively transfers actionable affordance
knowledge from human demonstrations to downstream open-vocabulary reasoning
tasks. GLOVER++ achieves state-of-the-art results on the HOVA-500K benchmark
and demonstrates strong generalization across diverse downstream robotic
manipulation tasks. By explicitly modeling actionable affordances, GLOVER++
facilitates robust transfer across scenes, modalities, and tasks. We hope that
HOVA-500K and the GLOVER++ framework will serve as valuable resources for
bridging the gap between human demonstrations and robotic manipulation
capabilities.