Scaling Semantic Categories: Investigating the Impact on Vision Transformer Labeling Performance
Journal:
arXiv
Published Date:
Mar 16, 2025
Abstract
This study explores the impact of scaling semantic categories on the image
classification performance of vision transformers (ViTs). In this specific
case, the CLIP server provided by Jina AI is used for experimentation. The
research hypothesizes that as the number of ground truth and artificially
introduced semantically equivalent categories increases, the labeling accuracy
of ViTs improves until a theoretical maximum or limit is reached. A wide
variety of image datasets were chosen to test this hypothesis. These datasets
were processed through a custom function in Python designed to evaluate the
model's accuracy, with adjustments being made to account for format differences
between datasets. By exponentially introducing new redundant categories, the
experiment assessed accuracy trends until they plateaued, decreased, or
fluctuated inconsistently. The findings show that while semantic scaling
initially increases model performance, the benefits diminish or reverse after
surpassing a critical threshold, providing insight into the limitations and
possible optimization of category labeling strategies for ViTs.