Ultrasound Report Generation with Multimodal Large Language Models for Standardized Texts
Journal:
arXiv
Published Date:
May 13, 2025
Abstract
Ultrasound (US) report generation is a challenging task due to the
variability of US images, operator dependence, and the need for standardized
text. Unlike X-ray and CT, US imaging lacks consistent datasets, making
automation difficult. In this study, we propose a unified framework for
multi-organ and multilingual US report generation, integrating fragment-based
multilingual training and leveraging the standardized nature of US reports. By
aligning modular text fragments with diverse imaging data and curating a
bilingual English-Chinese dataset, the method achieves consistent and
clinically accurate text generation across organ sites and languages.
Fine-tuning with selective unfreezing of the vision transformer (ViT) further
improves text-image alignment. Compared to the previous state-of-the-art KMVE
method, our approach achieves relative gains of about 2\% in BLEU scores,
approximately 3\% in ROUGE-L, and about 15\% in CIDEr, while significantly
reducing errors such as missing or incorrect content. By unifying multi-organ
and multi-language report generation into a single, scalable framework, this
work demonstrates strong potential for real-world clinical workflows.