Vision-Language Modeling in PET/CT for Visual Grounding of Positive Findings
Journal:
arXiv
Published Date:
Feb 1, 2025
Abstract
Vision-language models can connect the text description of an object to its
specific location in an image through visual grounding. This has potential
applications in enhanced radiology reporting. However, these models require
large annotated image-text datasets, which are lacking for PET/CT. We developed
an automated pipeline to generate weak labels linking PET/CT report
descriptions to their image locations and used it to train a 3D vision-language
visual grounding model. Our pipeline finds positive findings in PET/CT reports
by identifying mentions of SUVmax and axial slice numbers. From 25,578 PET/CT
exams, we extracted 11,356 sentence-label pairs. Using this data, we trained
ConTEXTual Net 3D, which integrates text embeddings from a large language model
with a 3D nnU-Net via token-level cross-attention. The model's performance was
compared against LLMSeg, a 2.5D version of ConTEXTual Net, and two nuclear
medicine physicians. The weak-labeling pipeline accurately identified lesion
locations in 98% of cases (246/251), with 7.5% requiring boundary adjustments.
ConTEXTual Net 3D achieved an F1 score of 0.80, outperforming LLMSeg (F1=0.22)
and the 2.5D model (F1=0.53), though it underperformed both physicians (F1=0.94
and 0.91). The model achieved better performance on FDG (F1=0.78) and DCFPyL
(F1=0.75) exams, while performance dropped on DOTATE (F1=0.58) and Fluciclovine
(F1=0.66). The model performed consistently across lesion sizes but showed
reduced accuracy on lesions with low uptake. Our novel weak labeling pipeline
accurately produced an annotated dataset of PET/CT image-text pairs,
facilitating the development of 3D visual grounding models. ConTEXTual Net 3D
significantly outperformed other models but fell short of the performance of
nuclear medicine physicians. Our study suggests that even larger datasets may
be needed to close this performance gap.