Rectifying Magnitude Neglect in Linear Attention
Journal:
arXiv
Published Date:
Jul 1, 2025
Abstract
As the core operator of Transformers, Softmax Attention exhibits excellent
global modeling capabilities. However, its quadratic complexity limits its
applicability to vision tasks. In contrast, Linear Attention shares a similar
formulation with Softmax Attention while achieving linear complexity, enabling
efficient global information modeling. Nevertheless, Linear Attention suffers
from a significant performance degradation compared to standard Softmax
Attention. In this paper, we analyze the underlying causes of this issue based
on the formulation of Linear Attention. We find that, unlike Softmax Attention,
Linear Attention entirely disregards the magnitude information of the Query.
This prevents the attention score distribution from dynamically adapting as the
Query scales. As a result, despite its structural similarity to Softmax
Attention, Linear Attention exhibits a significantly different attention score
distribution. Based on this observation, we propose Magnitude-Aware Linear
Attention (MALA), which modifies the computation of Linear Attention to fully
incorporate the Query's magnitude. This adjustment allows MALA to generate an
attention score distribution that closely resembles Softmax Attention while
exhibiting a more well-balanced structure. We evaluate the effectiveness of
MALA on multiple tasks, including image classification, object detection,
instance segmentation, semantic segmentation, natural language processing,
speech recognition, and image generation. Our MALA achieves strong results on
all of these tasks. Code will be available at https://github.com/qhfan/MALA