Decoding split-frequency representation for cross-scale tracking.
Journal:
Neural networks : the official journal of the International Neural Network Society
Published Date:
May 22, 2025
Abstract
Learning tailored target representations for tracking is a promising direction in visual object tracking. Most state-of-the-art methods utilize autoencoders to generate representations by reconstructing the target's appearance. However, these reconstructions are often augmented to mimic scale jitter and alteration, neglecting physical scale observations such as those in aerial videos. This article addresses the challenge of representation learning for cross-scale tracking in generalized scenarios. Specifically, we incorporate target scale directly into the positional encoding, indicating scale through relative pixel density rather than the conventional metric of image resolution. This scale-aware encoding is then integrated into the proposed asymptotic hierarchy of decoders, designed to reconstruct representations by emphasizing the restoration of high- and low-frequency features at large and tiny scales. The reconstruction process is guided by supervised learning using split losses, enabling the generation of robust cross-scale representations for generic objects. Extensive experiments on six benchmarks - GOT-10k, LaSOT, TrackingNet, DTB70, UAV123, and TNL2K - validate the superior performance of our method. Additionally, our tracker achieves a remarkable speed of 123 frames per second on a Graphics Processing Unit, surpassing the previous best autoencoder-based tracker. The code and raw results will be made publicly available at: https://github.com/pellab/DSC.