Surg-3M: A Dataset and Foundation Model for Perception in Surgical Settings
Journal:
arXiv
Published Date:
Mar 25, 2025
Abstract
Advancements in computer-assisted surgical procedures heavily rely on
accurate visual data interpretation from camera systems used during surgeries.
Traditional open-access datasets focusing on surgical procedures are often
limited by their small size, typically consisting of fewer than 100 videos with
less than 100K images. To address these constraints, a new dataset called
Surg-3M has been compiled using a novel aggregation pipeline that collects
high-resolution videos from online sources. Featuring an extensive collection
of over 4K surgical videos and more than 3 million high-quality images from
multiple procedure types, Surg-3M offers a comprehensive resource surpassing
existing alternatives in size and scope, including two novel tasks. To
demonstrate the effectiveness of this dataset, we present SurgFM, a
self-supervised foundation model pretrained on Surg-3M that achieves impressive
results in downstream tasks such as surgical phase recognition, action
recognition, and tool presence detection. Combining key components from
ConvNeXt, DINO, and an innovative augmented distillation method, SurgFM
exhibits exceptional performance compared to specialist architectures across
various benchmarks. Our experimental results show that SurgFM outperforms
state-of-the-art models in multiple downstream tasks, including significant
gains in surgical phase recognition (+8.9pp, +4.7pp, and +3.9pp of Jaccard in
AutoLaparo, M2CAI16, and Cholec80), action recognition (+3.1pp of mAP in
CholecT50) and tool presence detection (+4.6pp of mAP in Cholec80). Moreover,
even when using only half of the data, SurgFM outperforms state-of-the-art
models in AutoLaparo and achieves state-of-the-art performance in Cholec80.
Both Surg-3M and SurgFM have significant potential to accelerate progress
towards developing autonomous robotic surgery systems.