GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning
Journal:
arXiv
Published Date:
Jul 9, 2025
Abstract
Microscopic assessment of histopathology images is vital for accurate cancer
diagnosis and treatment. Whole Slide Image (WSI) classification and captioning
have become crucial tasks in computer-aided pathology. However, microscopic WSI
face challenges such as redundant patches and unknown patch positions due to
subjective pathologist captures. Moreover, generating automatic pathology
captions remains a significant challenge. To address these issues, we introduce
a novel GNN-ViTCap framework for classification and caption generation from
histopathological microscopic images. First, a visual feature extractor
generates patch embeddings. Redundant patches are then removed by dynamically
clustering these embeddings using deep embedded clustering and selecting
representative patches via a scalar dot attention mechanism. We build a graph
by connecting each node to its nearest neighbors in the similarity matrix and
apply a graph neural network to capture both local and global context. The
aggregated image embeddings are projected into the language model's input space
through a linear layer and combined with caption tokens to fine-tune a large
language model. We validate our method on the BreakHis and PatchGastric
datasets. GNN-ViTCap achieves an F1 score of 0.934 and an AUC of 0.963 for
classification, along with a BLEU-4 score of 0.811 and a METEOR score of 0.569
for captioning. Experimental results demonstrate that GNN-ViTCap outperforms
state of the art approaches, offering a reliable and efficient solution for
microscopy based patient diagnosis.