Anatomical Attention Alignment representation for Radiology Report Generation
Journal:
arXiv
Published Date:
May 12, 2025
Abstract
Automated Radiology report generation (RRG) aims at producing detailed
descriptions of medical images, reducing radiologists' workload and improving
access to high-quality diagnostic services. Existing encoder-decoder models
only rely on visual features extracted from raw input images, which can limit
the understanding of spatial structures and semantic relationships, often
resulting in suboptimal text generation. To address this, we propose Anatomical
Attention Alignment Network (A3Net), a framework that enhance visual-textual
understanding by constructing hyper-visual representations. Our approach
integrates a knowledge dictionary of anatomical structures with patch-level
visual features, enabling the model to effectively associate image regions with
their corresponding anatomical entities. This structured representation
improves semantic reasoning, interpretability, and cross-modal alignment,
ultimately enhancing the accuracy and clinical relevance of generated reports.
Experimental results on IU X-Ray and MIMIC-CXR datasets demonstrate that A3Net
significantly improves both visual perception and text generation quality. Our
code is available at \href{https://github.com/Vinh-AI/A3Net}{GitHub}.