MapleGrasp: Mask-guided Feature Pooling for Language-driven Efficient Robotic Grasping
Journal:
arXiv
Published Date:
Jun 6, 2025
Abstract
Robotic manipulation of unseen objects via natural language commands remains
challenging. Language driven robotic grasping (LDRG) predicts stable grasp
poses from natural language queries and RGB-D images. Here we introduce
Mask-guided feature pooling, a lightweight enhancement to existing LDRG
methods. Our approach employs a two-stage training strategy: first, a
vision-language model generates feature maps from CLIP-fused embeddings, which
are upsampled and weighted by text embeddings to produce segmentation masks.
Next, the decoder generates separate feature maps for grasp prediction, pooling
only token features within these masked regions to efficiently predict grasp
poses. This targeted pooling approach reduces computational complexity,
accelerating both training and inference. Incorporating mask pooling results in
a 12% improvement over prior approaches on the OCID-VLG benchmark. Furthermore,
we introduce RefGraspNet, an open-source dataset eight times larger than
existing alternatives, significantly enhancing model generalization for
open-vocabulary grasping. By extending 2D grasp predictions to 3D via depth
mapping and inverse kinematics, our modular method achieves performance
comparable to recent Vision-Language-Action (VLA) models on the LIBERO
simulation benchmark, with improved generalization across different task
suites. Real-world experiments on a 7 DoF Franka robotic arm demonstrate a 57%
success rate with unseen objects, surpassing competitive baselines by 7%. Code
will be released post publication.