A Global Visual Information Intervention Model for Medical Visual Question Answering.
Journal:
Computers in biology and medicine
Published Date:
Apr 28, 2025
Abstract
Medical Visual Question Answering (Med-VQA) aims to furnish precise responses to clinical queries related to medical imagery. While its transformative potential in healthcare is undeniable, current solutions remain nascent and are yet to see widespread clinical adoption. Med-VQA presents heightened complexities compared to standard visual question answering (VQA) tasks due to the myriad of clinical scenarios and the scarcity of labeled medical imagery. This often culminates in language biases and overfitting vulnerabilities. In light of these challenges, this study introduces Global Visual Information Intervention (GVII), an innovative Med-VQA model designed to mitigate language biases and improve model generalizability. GVII is centered on two key branches: the Global Visual Information Branch (GVIB), which extracts and filters holistic visual data to amplify the image's contribution and reduce question dominance, and the Forward Compensation Branch (FCB), which refines multimodal features to counterbalance disruptions introduced by GVIB. These branches work in tandem to enhance predictive accuracy and robustness. Furthermore, a multi-branch fusion mechanism ensures cohesive integration of features and losses across the model. Experimental results demonstrate that the proposed model outperforms existing state-of-the-art models, achieving a 2.6% improvement in accuracy on the PathVQA dataset. In conclusion, the GVII-based Med-VQA model not only successfully mitigates prevalent language biases and overfitting issues but also significantly improves diagnostic precision, offering a considerable stride toward robust, clinically applicable VQA systems.