Grasp What You Want: Embodied Dexterous Grasping System Driven by Your Voice
Journal:
arXiv
Published Date:
Dec 14, 2024
Abstract
In recent years, as robotics has advanced, human-robot collaboration has
gained increasing importance. However, current robots struggle to fully and
accurately interpret human intentions from voice commands alone. Traditional
gripper and suction systems often fail to interact naturally with humans, lack
advanced manipulation capabilities, and are not adaptable to diverse tasks,
especially in unstructured environments. This paper introduces the Embodied
Dexterous Grasping System (EDGS), designed to tackle object grasping in
cluttered environments for human-robot interaction. We propose a novel approach
to semantic-object alignment using a Vision-Language Model (VLM) that fuses
voice commands and visual information, significantly enhancing the alignment of
multi-dimensional attributes of target objects in complex scenarios. Inspired
by human hand-object interactions, we develop a robust, precise, and efficient
grasping strategy, incorporating principles like the thumb-object axis,
multi-finger wrapping, and fingertip interaction with an object's contact
mechanics. We also design experiments to assess Referring Expression
Representation Enrichment (RERE) in referring expression segmentation,
demonstrating that our system accurately detects and matches referring
expressions. Extensive experiments confirm that EDGS can effectively handle
complex grasping tasks, achieving stability and high success rates,
highlighting its potential for further development in the field of Embodied AI.