fine-CLIP: Enhancing Zero-Shot Fine-Grained Surgical Action Recognition with Vision-Language Models
Journal:
arXiv
Published Date:
Mar 25, 2025
Abstract
While vision-language models like CLIP have advanced zero-shot surgical phase
recognition, they struggle with fine-grained surgical activities, especially
action triplets. This limitation arises because current CLIP formulations rely
on global image features, which overlook the fine-grained semantics and
contextual details crucial for complex tasks like zero-shot triplet
recognition. Furthermore, these models do not explore the hierarchical
structure inherent in triplets, reducing their ability to generalize to novel
triplets. To address these challenges, we propose fine-CLIP, which learns
object-centric features and leverages the hierarchy in triplet formulation. Our
approach integrates three components: hierarchical prompt modeling to capture
shared semantics, LoRA-based vision backbone adaptation for enhanced feature
extraction, and a graph-based condensation strategy that groups similar patch
features into meaningful object clusters. Since triplet classification is a
challenging task, we introduce an alternative yet meaningful base-to-novel
generalization benchmark with two settings on the CholecT50 dataset:
Unseen-Target, assessing adaptability to triplets with novel anatomical
structures, and Unseen-Instrument-Verb, where models need to generalize to
novel instrument-verb interactions. fine-CLIP shows significant improvements in
F1 and mAP, enhancing zero-shot recognition of novel surgical triplets.