Exploring the Design Space of 3D MLLMs for CT Report Generation
Journal:
arXiv
Published Date:
Jun 26, 2025
Abstract
Multimodal Large Language Models (MLLMs) have emerged as a promising way to
automate Radiology Report Generation (RRG). In this work, we systematically
investigate the design space of 3D MLLMs, including visual input
representation, projectors, Large Language Models (LLMs), and fine-tuning
techniques for 3D CT report generation. We also introduce two knowledge-based
report augmentation methods that improve performance on the GREEN score by up
to 10\%, achieving the 2nd place on the MICCAI 2024 AMOS-MM challenge. Our
results on the 1,687 cases from the AMOS-MM dataset show that RRG is largely
independent of the size of LLM under the same training protocol. We also show
that larger volume size does not always improve performance if the original ViT
was pre-trained on a smaller volume size. Lastly, we show that using a
segmentation mask along with the CT volume improves performance. The code is
publicly available at https://github.com/bowang-lab/AMOS-MM-Solution