Mamba-Adaptor: State Space Model Adaptor for Visual Recognition
Journal:
arXiv
Published Date:
May 19, 2025
Abstract
Recent State Space Models (SSM), especially Mamba, have demonstrated
impressive performance in visual modeling and possess superior model
efficiency. However, the application of Mamba to visual tasks suffers inferior
performance due to three main constraints existing in the sequential model: 1)
Casual computing is incapable of accessing global context; 2) Long-range
forgetting when computing the current hidden states; 3) Weak spatial structural
modeling due to the transformed sequential input. To address these issues, we
investigate a simple yet powerful vision task Adaptor for Mamba models, which
consists of two functional modules: Adaptor-T and Adaptor-S. When solving the
hidden states for SSM, we apply a lightweight prediction module Adaptor-T to
select a set of learnable locations as memory augmentations to ease long-range
forgetting issues. Moreover, we leverage Adapator-S, composed of multi-scale
dilated convolutional kernels, to enhance the spatial modeling and introduce
the image inductive bias into the feature output. Both modules can enlarge the
context modeling in casual computing, as the output is enhanced by the
inaccessible features. We explore three usages of Mamba-Adaptor: A general
visual backbone for various vision tasks; A booster module to raise the
performance of pretrained backbones; A highly efficient fine-tuning module that
adapts the base model for transfer learning tasks. Extensive experiments verify
the effectiveness of Mamba-Adaptor in three settings. Notably, our
Mamba-Adaptor achieves state-of the-art performance on the ImageNet and COCO
benchmarks.