UnICLAM: Contrastive representation learning with adversarial masking for unified and interpretable Medical Vision Question Answering.

Journal: Medical image analysis
Published Date:

Abstract

Medical Visual Question Answering aims to assist doctors in decision-making when answering clinical questions regarding radiology images. Nevertheless, current models learn cross-modal representations through residing vision and text encoders in dual separate spaces, which inevitably leads to indirect semantic alignment. In this paper, we propose UnICLAM, a Unified and Interpretable Medical-VQA model through Contrastive Representation Learning with Adversarial Masking. To achieve the learning of an aligned image-text representation, we first establish a unified dual-stream pre-training structure with the gradually soft-parameter sharing strategy for alignment. Specifically, the proposed strategy learns a constraint for the vision and text encoders to be close in the same space, which is gradually loosened as the number of layers increases, so as to narrow the distance between the two different modalities. For grasping the unified semantic cross-modal representation, we extend the adversarial masking data augmentation to the contrastive representation learning of vision and text in a unified manner. While the encoder training minimizes the distance between the original and masking samples, the adversarial masking module keeps adversarial learning to conversely maximize the distance. We also intuitively take a further exploration of the unified adversarial masking augmentation method, which improves the potential ante-hoc interpretability with remarkable performance and efficiency. Experimental results on VQA-RAD and SLAKE benchmarks demonstrate that UnICLAM outperforms existing 11 state-of-the-art Medical-VQA methods. More importantly, we make an additional discussion about the performance of UnICLAM in diagnosing heart failure, verifying that UnICLAM exhibits superior few-shot adaption performance in practical disease diagnosis. The codes and models will be released upon the acceptance of the paper.

Authors

  • Chenlu Zhan
    College of Computer Science and Technology, Zhejiang University, Hangzhou, 310058, China. Electronic address: chenlu.22@intl.zju.edu.
  • Peng Peng
    School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, 214122, Jiangsu, China.
  • Hongwei Wang
    Department of Oncological Surgery, Harbin Medical University Cancer Hospital, Harbin, 150000, Heilongjiang Province, China.
  • Gaoang Wang
  • Yu Lin
    Research School of Computer Science, Australian National University, Canberra, 2601, ACT, Australia.
  • Tao Chen
    School of Automation, Northwestern Polytechnical University, Xi'an, 710072, Shaanxi, China.
  • Hongsen Wang
    Department of Cardiology, The Sixth Medical Center, Chinese PLA General Hospital, 28 Fuxing Road, Haidian District, Beijing, 100853, China.