Slot-BERT: Self-supervised object discovery in surgical video.
Journal:
Medical image analysis
Published Date:
Feb 3, 2026
Abstract
Object-centric slot attention is a powerful framework for unsupervised learning of structured and explainable representations that can support reasoning about objects and actions, including in surgical video. However, current object-centric models either fail to reliably capture object dependencies in seconds-long video episodes that encompass surgical actions and tasks or are computationally too expensive for practical implementation. We introduce Slot-BERT, a slot attention model with a temporal slot transformer module to overcome these limitations. Our core innovations are: 1) A bidirectional transformer module that processes object-centric slot representations, enabling longer-range temporal coherence; 2) A slot-contrastive loss that further improves the representation by enforcing slot dissimilarity; 3) We evaluate Slot-BERT on real-world surgical video datasets from abdominal, cholecystectomy, and thoracic procedures, and on real and synthetic videos with everyday objects. Our method surpasses state-of-the-art object-centric approaches under unsupervised training achieving superior performance across these domains. We also demonstrate efficient zero-shot domain adaptation to data from diverse surgical specialties and databases.
Authors
Keywords
No keywords available for this article.