A vision attention driven Language framework for medical report generation.
Journal:
Scientific reports
PMID:
40155699
Abstract
This study introduces the Medical Vision Attention Generation (MedVAG) model, a novel framework designed to facilitate the automated generation of medical reports. MedVAG integrates Vision Transformer (ViT)-based visual feature extraction and GPT-2 language modeling, enhanced by graph-based feature fusion and multiple attention mechanisms (co-attention, cross-attention, memory-guided attention), to ensure semantic coherence and diagnostic accuracy. Evaluated on IU X-Ray and COV-CTR datasets, the model achieved state-of-the-art performance across natural language generation metrics (BLEU, METEOR, ROUGE, CIDEr) and clinical effectiveness measures. Ablation studies highlighted the critical role of attention mechanisms and feature fusion in aligning visual and textual features. MedVAG demonstrates strong potential as an assistive technology, aiming to support radiologists by reducing workload and enhancing diagnostic accuracy.