Hi-End-MAE: Hierarchical encoder-driven masked autoencoders are stronger vision learners for medical image segmentation
Journal:
arXiv
Published Date:
Feb 12, 2025
Abstract
Medical image segmentation remains a formidable challenge due to the label
scarcity. Pre-training Vision Transformer (ViT) through masked image modeling
(MIM) on large-scale unlabeled medical datasets presents a promising solution,
providing both computational efficiency and model generalization for various
downstream tasks. However, current ViT-based MIM pre-training frameworks
predominantly emphasize local aggregation representations in output layers and
fail to exploit the rich representations across different ViT layers that
better capture fine-grained semantic information needed for more precise
medical downstream tasks. To fill the above gap, we hereby present Hierarchical
Encoder-driven MAE (Hi-End-MAE), a simple yet effective ViT-based pre-training
solution, which centers on two key innovations: (1) Encoder-driven
reconstruction, which encourages the encoder to learn more informative features
to guide the reconstruction of masked patches; and (2) Hierarchical dense
decoding, which implements a hierarchical decoding structure to capture rich
representations across different layers. We pre-train Hi-End-MAE on a
large-scale dataset of 10K CT scans and evaluated its performance across seven
public medical image segmentation benchmarks. Extensive experiments demonstrate
that Hi-End-MAE achieves superior transfer learning capabilities across various
downstream tasks, revealing the potential of ViT in medical imaging
applications. The code is available at:
https://github.com/FengheTan9/Hi-End-MAE