Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Journal: arXiv

Published Date: Dec 6, 2024

Abstract

Transformers, a groundbreaking architecture proposed for Natural Language Processing (NLP), have also achieved remarkable success in Computer Vision. A cornerstone of their success lies in the attention mechanism, which models relationships among tokens. While the tokenization process in NLP inherently ensures that a single token does not contain multiple semantics, the tokenization of Vision Transformer (ViT) utilizes tokens from uniformly partitioned square image patches, which may result in an arbitrary mixing of visual concepts in a token. In this work, we propose to substitute the grid-based tokenization in ViT with superpixel tokenization, which employs superpixels to generate a token that encapsulates a sole visual concept. Unfortunately, the diverse shapes, sizes, and locations of superpixels make integrating superpixels into ViT tokenization rather challenging. Our tokenization pipeline, comprised of pre-aggregate extraction and superpixel-aware aggregation, overcomes the challenges that arise in superpixel tokenization. Extensive experiments demonstrate that our approach, which exhibits strong compatibility with existing frameworks, enhances the accuracy and robustness of ViT on various downstream tasks.

Authors

Jaihyun Lew
Soohyuk Jang
Jaehoon Lee
Seungryong Yoo
Eunji Kim
Saehyung Lee
Jisoo Mok
Siwon Kim
Sungroh Yoon

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2412.04680v3)

Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals