Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation
Journal:
arXiv
Published Date:
Jun 13, 2025
Abstract
Self-supervised learning (SSL) has achieved major advances in natural images
and video understanding, but challenges remain in domains like echocardiography
(heart ultrasound) due to subtle anatomical structures, complex temporal
dynamics, and the current lack of domain-specific pre-trained models. Existing
SSL approaches such as contrastive, masked modeling, and clustering-based
methods struggle with high intersample similarity, sensitivity to low PSNR
inputs common in ultrasound, or aggressive augmentations that distort
clinically relevant features. We present DISCOVR (Distilled Image Supervision
for Cross Modal Video Representation), a self-supervised dual branch framework
for cardiac ultrasound video representation learning. DISCOVR combines a
clustering-based video encoder that models temporal dynamics with an online
image encoder that extracts fine-grained spatial semantics. These branches are
connected through a semantic cluster distillation loss that transfers
anatomical knowledge from the evolving image encoder to the video encoder,
enabling temporally coherent representations enriched with fine-grained
semantic understanding. Evaluated on six echocardiography datasets spanning
fetal, pediatric, and adult populations, DISCOVR outperforms both specialized
video anomaly detection methods and state-of-the-art video-SSL baselines in
zero-shot and linear probing setups, and achieves superior segmentation
transfer.