Spatio-Temporal Representation Decoupling and Enhancement for Federated Instrument Segmentation in Surgical Videos
Journal:
arXiv
Published Date:
Jun 30, 2025
Abstract
Surgical instrument segmentation under Federated Learning (FL) is a promising
direction, which enables multiple surgical sites to collaboratively train the
model without centralizing datasets. However, there exist very limited FL works
in surgical data science, and FL methods for other modalities do not consider
inherent characteristics in surgical domain: i) different scenarios show
diverse anatomical backgrounds while highly similar instrument representation;
ii) there exist surgical simulators which promote large-scale synthetic data
generation with minimal efforts. In this paper, we propose a novel Personalized
FL scheme, Spatio-Temporal Representation Decoupling and Enhancement (FedST),
which wisely leverages surgical domain knowledge during both local-site and
global-server training to boost segmentation. Concretely, our model embraces a
Representation Separation and Cooperation (RSC) mechanism in local-site
training, which decouples the query embedding layer to be trained privately, to
encode respective backgrounds. Meanwhile, other parameters are optimized
globally to capture the consistent representations of instruments, including
the temporal layer to capture similar motion patterns. A textual-guided channel
selection is further designed to highlight site-specific features, facilitating
model adapta tion to each site. Moreover, in global-server training, we propose
Synthesis-based Explicit Representation Quantification (SERQ), which defines an
explicit representation target based on synthetic data to synchronize the model
convergence during fusion for improving model generalization.