Video CLIP Model for Multi-View Echocardiography Interpretation
Journal:
arXiv
Published Date:
Apr 26, 2025
Abstract
Echocardiography involves recording videos of the heart using ultrasound,
enabling clinicians to evaluate its condition. Recent advances in large-scale
vision-language models (VLMs) have garnered attention for automating the
interpretation of echocardiographic videos. However, most existing VLMs
proposed for medical interpretation thus far rely on single-frame (i.e., image)
inputs. Consequently, these image-based models often exhibit lower diagnostic
accuracy for conditions identifiable through cardiac motion. Moreover,
echocardiographic videos are recorded from various views that depend on the
direction of ultrasound emission, and certain views are more suitable than
others for interpreting specific conditions. Incorporating multiple views could
potentially yield further improvements in accuracy. In this study, we developed
a video-language model that takes five different views and full video sequences
as input, training it on pairs of echocardiographic videos and clinical reports
from 60,747 cases. Our experiments demonstrate that this expanded approach
achieves higher interpretation accuracy than models trained with only
single-view videos or with still images.