SW-ViT: A Spatio-Temporal Vision Transformer Network with Post Denoiser for Sequential Multi-Push Ultrasound Shear Wave Elastography
Journal:
arXiv
Published Date:
May 24, 2025
Abstract
Objective: Ultrasound Shear Wave Elastography (SWE) demonstrates great
potential in assessing soft-tissue pathology by mapping tissue stiffness, which
is linked to malignancy. Traditional SWE methods have shown promise in
estimating tissue elasticity, yet their susceptibility to noise interference,
reliance on limited training data, and inability to generate segmentation masks
concurrently present notable challenges to accuracy and reliability. Approach:
In this paper, we propose SW-ViT, a novel two-stage deep learning framework for
SWE that integrates a CNN-Spatio-Temporal Vision Transformer-based
reconstruction network with an efficient Transformer-based post-denoising
network. The first stage uses a 3D ResNet encoder with multi-resolution
spatio-temporal Transformer blocks that capture spatial and temporal features,
followed by a squeeze-and-excitation attention decoder that reconstructs 2D
stiffness maps. To address data limitations, a patch-based training strategy is
adopted for localized learning and reconstruction. In the second stage, a
denoising network with a shared encoder and dual decoders processes inclusion
and background regions to produce a refined stiffness map and segmentation
mask. A hybrid loss combining regional, smoothness, fusion, and Intersection
over Union (IoU) components ensures improvements in both reconstruction and
segmentation. Results: On simulated data, our method achieves PSNR of 32.68 dB,
CNR of 46.78 dB, and SSIM of 0.995. On phantom data, results include PSNR of
21.11 dB, CNR of 42.14 dB, and SSIM of 0.936. Segmentation IoU values reach
0.949 (simulation) and 0.738 (phantom) with ASSD values being 0.184 and 1.011,
respectively. Significance: SW-ViT delivers robust, high-quality elasticity map
estimates from noisy SWE data and holds clear promise for clinical application.