Masked strategies for images with small objects
Journal:
arXiv
Published Date:
Apr 24, 2025
Abstract
The hematology analytics used for detection and classification of small blood
components is a significant challenge. In particular, when objects exists as
small pixel-sized entities in a large context of similar objects. Deep learning
approaches using supervised models with pre-trained weights, such as residual
networks and vision transformers have demonstrated success for many
applications. Unfortunately, when applied to images outside the domain of
learned representations, these methods often result with less than acceptable
performance. A strategy to overcome this can be achieved by using
self-supervised models, where representations are learned and weights are then
applied for downstream applications. Recently, masked autoencoders have proven
to be effective to obtain representations that captures global context
information. By masking regions of an image and having the model learn to
reconstruct both the masked and non-masked regions, weights can be used for
various applications. However, if the sizes of the objects in images are less
than the size of the mask, the global context information is lost, making it
almost impossible to reconstruct the image. In this study, we investigated the
effect of mask ratios and patch sizes for blood components using a MAE to
obtain learned ViT encoder representations. We then applied the encoder weights
to train a U-Net Transformer for semantic segmentation to obtain both local and
global contextual information. Our experimental results demonstrates that both
smaller mask ratios and patch sizes improve the reconstruction of images using
a MAE. We also show the results of semantic segmentation with and without
pre-trained weights, where smaller-sized blood components benefited with
pre-training. Overall, our proposed method offers an efficient and effective
strategy for the segmentation and classification of small objects.