SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking
Journal:
arXiv
Published Date:
Mar 24, 2025
Abstract
Most state-of-the-art trackers adopt one-stream paradigm, using a single
Vision Transformer for joint feature extraction and relation modeling of
template and search region images. However, relation modeling between different
image patches exhibits significant variations. For instance, background regions
dominated by target-irrelevant information require reduced attention
allocation, while foreground, particularly boundary areas, need to be be
emphasized. A single model may not effectively handle all kinds of relation
modeling simultaneously. In this paper, we propose a novel tracker called
SPMTrack based on mixture-of-experts tailored for visual tracking task (TMoE),
combining the capability of multiple experts to handle diverse relation
modeling more flexibly. Benefiting from TMoE, we extend relation modeling from
image pairs to spatio-temporal context, further improving tracking accuracy
with minimal increase in model parameters. Moreover, we employ TMoE as a
parameter-efficient fine-tuning method, substantially reducing trainable
parameters, which enables us to train SPMTrack of varying scales efficiently
and preserve the generalization ability of pretrained models to achieve
superior performance. We conduct experiments on seven datasets, and
experimental results demonstrate that our method significantly outperforms
current state-of-the-art trackers. The source code is available at
https://github.com/WenRuiCai/SPMTrack.