PaliGemma-CXR: A Multi-task Multimodal Model for TB Chest X-ray Interpretation
Journal:
arXiv
Published Date:
Feb 28, 2025
Abstract
Tuberculosis (TB) is a infectious global health challenge. Chest X-rays are a
standard method for TB screening, yet many countries face a critical shortage
of radiologists capable of interpreting these images. Machine learning offers
an alternative, as it can automate tasks such as disease diagnosis, and report
generation. However, traditional approaches rely on task-specific models, which
cannot utilize the interdependence between tasks. Building a multi-task model
capable of performing multiple tasks poses additional challenges such as
scarcity of multimodal data, dataset imbalance, and negative transfer. To
address these challenges, we propose PaliGemma-CXR, a multi-task multimodal
model capable of performing TB diagnosis, object detection, segmentation,
report generation, and VQA. Starting with a dataset of chest X-ray images
annotated with TB diagnosis labels and segmentation masks, we curated a
multimodal dataset to support additional tasks. By finetuning PaliGemma on this
dataset and sampling data using ratios of the inverse of the size of task
datasets, we achieved the following results across all tasks: 90.32% accuracy
on TB diagnosis and 98.95% on close-ended VQA, 41.3 BLEU score on report
generation, and a mAP of 19.4 and 16.0 on object detection and segmentation,
respectively. These results demonstrate that PaliGemma-CXR effectively
leverages the interdependence between multiple image interpretation tasks to
enhance performance.