UnICLAM: Contrastive representation learning with adversarial masking for unified and interpretable Medical Vision Question Answering.

Journal: Medical image analysis

Published Date: Jan 15, 2025

Abstract

Medical Visual Question Answering aims to assist doctors in decision-making when answering clinical questions regarding radiology images. Nevertheless, current models learn cross-modal representations through residing vision and text encoders in dual separate spaces, which inevitably leads to indirect semantic alignment. In this paper, we propose UnICLAM, a Unified and Interpretable Medical-VQA model through Contrastive Representation Learning with Adversarial Masking. To achieve the learning of an aligned image-text representation, we first establish a unified dual-stream pre-training structure with the gradually soft-parameter sharing strategy for alignment. Specifically, the proposed strategy learns a constraint for the vision and text encoders to be close in the same space, which is gradually loosened as the number of layers increases, so as to narrow the distance between the two different modalities. For grasping the unified semantic cross-modal representation, we extend the adversarial masking data augmentation to the contrastive representation learning of vision and text in a unified manner. While the encoder training minimizes the distance between the original and masking samples, the adversarial masking module keeps adversarial learning to conversely maximize the distance. We also intuitively take a further exploration of the unified adversarial masking augmentation method, which improves the potential ante-hoc interpretability with remarkable performance and efficiency. Experimental results on VQA-RAD and SLAKE benchmarks demonstrate that UnICLAM outperforms existing 11 state-of-the-art Medical-VQA methods. More importantly, we make an additional discussion about the performance of UnICLAM in diagnosing heart failure, verifying that UnICLAM exhibits superior few-shot adaption performance in practical disease diagnosis. The codes and models will be released upon the acceptance of the paper.

Authors

Chenlu Zhan

College of Computer Science and Technology, Zhejiang University, Hangzhou, 310058, China. Electronic address: chenlu.22@intl.zju.edu.
Peng Peng

School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, 214122, Jiangsu, China.
Hongwei Wang

Department of Oncological Surgery, Harbin Medical University Cancer Hospital, Harbin, 150000, Heilongjiang Province, China.
Gaoang Wang
Yu Lin

Research School of Computer Science, Australian National University, Canberra, 2601, ACT, Australia.
Tao Chen

School of Automation, Northwestern Polytechnical University, Xi'an, 710072, Shaanxi, China.
Hongsen Wang

Department of Cardiology, The Sixth Medical Center, Chinese PLA General Hospital, 28 Fuxing Road, Haidian District, Beijing, 100853, China.

Keywords

Algorithms Humans Image Interpretation, Computer-Assisted Machine Learning

External Resources

View on PubMed Access via DOI PubMed (39847954)

UnICLAM: Contrastive representation learning with adversarial masking for unified and interpretable Medical Vision Question Answering.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals