Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis
Journal:
arXiv
Published Date:
Jun 10, 2025
Abstract
Medical ultrasonography is an essential imaging technique for examining
superficial organs and tissues, including lymph nodes, breast, and thyroid. It
employs high-frequency ultrasound waves to generate detailed images of the
internal structures of the human body. However, manually contouring regions of
interest in these images is a labor-intensive task that demands expertise and
often results in inconsistent interpretations among individuals.
Vision-language foundation models, which have excelled in various computer
vision applications, present new opportunities for enhancing ultrasound image
analysis. Yet, their performance is hindered by the significant differences
between natural and medical imaging domains. This research seeks to overcome
these challenges by developing domain adaptation methods for vision-language
foundation models. In this study, we explore the fine-tuning pipeline for
vision-language foundation models by utilizing large language model as text
refiner with special-designed adaptation strategies and task-driven heads. Our
approach has been extensively evaluated on six ultrasound datasets and two
tasks: segmentation and classification. The experimental results show that our
method can effectively improve the performance of vision-language foundation
models for ultrasound image analysis, and outperform the existing
state-of-the-art vision-language and pure foundation models. The source code of
this study is available at
\href{https://github.com/jinggqu/NextGen-UIA}{GitHub}.