A vision attention driven Language framework for medical report generation.

Journal: Scientific reports
PMID:

Abstract

This study introduces the Medical Vision Attention Generation (MedVAG) model, a novel framework designed to facilitate the automated generation of medical reports. MedVAG integrates Vision Transformer (ViT)-based visual feature extraction and GPT-2 language modeling, enhanced by graph-based feature fusion and multiple attention mechanisms (co-attention, cross-attention, memory-guided attention), to ensure semantic coherence and diagnostic accuracy. Evaluated on IU X-Ray and COV-CTR datasets, the model achieved state-of-the-art performance across natural language generation metrics (BLEU, METEOR, ROUGE, CIDEr) and clinical effectiveness measures. Ablation studies highlighted the critical role of attention mechanisms and feature fusion in aligning visual and textual features. MedVAG demonstrates strong potential as an assistive technology, aiming to support radiologists by reducing workload and enhancing diagnostic accuracy.

Authors

  • Merve Varol Arısoy
    Bucak Faculty of Computer and Informatics, Information Systems Engineering Department, Burdur Mehmet Akif Ersoy University, Burdur, Turkey. mvarisoy@mehmetakif.edu.tr.
  • Ayhan Arısoy
    Bucak Faculty of Computer and Informatics, Information Systems Engineering Department, Burdur Mehmet Akif Ersoy University, Burdur, Turkey.
  • İlhan Uysal
    Information Systems and Technologies. Depart, Burdur Mehmet Akif Ersoy University, Bucak Zeliha Tolunay School of Applied Technology and Business, Burdur, Turkey.