EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy
Journal:
arXiv
Published Date:
May 21, 2025
Abstract
In endoscopic procedures, autonomous tracking of abnormal regions and
following circumferential cutting markers can significantly reduce the
cognitive burden on endoscopists. However, conventional model-based pipelines
are fragile for each component (e.g., detection, motion planning) requires
manual tuning and struggles to incorporate high-level endoscopic intent,
leading to poor generalization across diverse scenes. Vision-Language-Action
(VLA) models, which integrate visual perception, language grounding, and motion
planning within an end-to-end framework, offer a promising alternative by
semantically adapting to surgeon prompts without manual recalibration. Despite
their potential, applying VLA models to robotic endoscopy presents unique
challenges due to the complex and dynamic anatomical environments of the
gastrointestinal (GI) tract. To address this, we introduce EndoVLA, designed
specifically for continuum robots in GI interventions. Given endoscopic images
and surgeon-issued tracking prompts, EndoVLA performs three core tasks: (1)
polyp tracking, (2) delineation and following of abnormal mucosal regions, and
(3) adherence to circular markers during circumferential cutting. To tackle
data scarcity and domain shifts, we propose a dual-phase strategy comprising
supervised fine-tuning on our EndoVLA-Motion dataset and reinforcement
fine-tuning with task-aware rewards. Our approach significantly improves
tracking performance in endoscopy and enables zero-shot generalization in
diverse scenes and complex sequential tasks.