Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens
Journal:
arXiv
Published Date:
Dec 6, 2024
Abstract
Transformers, a groundbreaking architecture proposed for Natural Language
Processing (NLP), have also achieved remarkable success in Computer Vision. A
cornerstone of their success lies in the attention mechanism, which models
relationships among tokens. While the tokenization process in NLP inherently
ensures that a single token does not contain multiple semantics, the
tokenization of Vision Transformer (ViT) utilizes tokens from uniformly
partitioned square image patches, which may result in an arbitrary mixing of
visual concepts in a token. In this work, we propose to substitute the
grid-based tokenization in ViT with superpixel tokenization, which employs
superpixels to generate a token that encapsulates a sole visual concept.
Unfortunately, the diverse shapes, sizes, and locations of superpixels make
integrating superpixels into ViT tokenization rather challenging. Our
tokenization pipeline, comprised of pre-aggregate extraction and
superpixel-aware aggregation, overcomes the challenges that arise in superpixel
tokenization. Extensive experiments demonstrate that our approach, which
exhibits strong compatibility with existing frameworks, enhances the accuracy
and robustness of ViT on various downstream tasks.