RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics
Journal:
arXiv
Published Date:
Apr 2, 2025
Abstract
Visual Language Models (VLMs) have emerged as pivotal tools for robotic
systems, enabling cross-task generalization, dynamic environmental interaction,
and long-horizon planning through multimodal perception and semantic reasoning.
However, existing open-source VLMs predominantly trained for generic
vision-language alignment tasks fail to model temporally correlated action
semantics that are crucial for robotic manipulation effectively. While current
image-based fine-tuning methods partially adapt VLMs to robotic applications,
they fundamentally disregard temporal evolution patterns in video sequences and
suffer from visual feature entanglement between robotic agents, manipulated
objects, and environmental contexts, thereby limiting semantic decoupling
capability for atomic actions and compromising model generalizability.To
overcome these challenges, this work presents RoboAct-CLIP with dual technical
contributions: 1) A dataset reconstruction framework that performs
semantic-constrained action unit segmentation and re-annotation on open-source
robotic videos, constructing purified training sets containing singular atomic
actions (e.g., "grasp"); 2) A temporal-decoupling fine-tuning strategy based on
Contrastive Language-Image Pretraining (CLIP) architecture, which disentangles
temporal action features across video frames from object-centric
characteristics to achieve hierarchical representation learning of robotic
atomic actions.Experimental results in simulated environments demonstrate that
the RoboAct-CLIP pretrained model achieves a 12% higher success rate than
baseline VLMs, along with superior generalization in multi-object manipulation
tasks.