Advancing Chronic Tuberculosis Diagnostics Using Vision-Language Models: A Multi modal Framework for Precision Analysis
Journal:
arXiv
Published Date:
Mar 17, 2025
Abstract
Background: This study proposes a Vision-Language Model (VLM) leveraging the
SIGLIP encoder and Gemma-3b transformer decoder to enhance automated chronic
tuberculosis (TB) screening. By integrating chest X-ray images with clinical
data, the model addresses the challenges of manual interpretation, improving
diagnostic consistency and accessibility, particularly in resource-constrained
settings.
Methods: The VLM architecture combines a Vision Transformer (ViT) for visual
encoding and a transformer-based text encoder to process clinical context, such
as patient histories and treatment records. Cross-modal attention mechanisms
align radiographic features with textual information, while the Gemma-3b
decoder generates comprehensive diagnostic reports. The model was pre-trained
on 5 million paired medical images and texts and fine-tuned using 100,000
chronic TB-specific chest X-rays.
Results: The model demonstrated high precision (94 percent) and recall (94
percent) for detecting key chronic TB pathologies, including fibrosis,
calcified granulomas, and bronchiectasis. Area Under the Curve (AUC) scores
exceeded 0.93, and Intersection over Union (IoU) values were above 0.91,
validating its effectiveness in detecting and localizing TB-related
abnormalities.
Conclusion: The VLM offers a robust and scalable solution for automated
chronic TB diagnosis, integrating radiographic and clinical data to deliver
actionable and context-aware insights. Future work will address subtle
pathologies and dataset biases to enhance the model's generalizability,
ensuring equitable performance across diverse populations and healthcare
settings.