Revisiting Transformers with Insights from Image Filtering
Journal:
arXiv
Published Date:
Jun 12, 2025
Abstract
The self-attention mechanism, a cornerstone of Transformer-based
state-of-the-art deep learning architectures, is largely heuristic-driven and
fundamentally challenging to interpret. Establishing a robust theoretical
foundation to explain its remarkable success and limitations has therefore
become an increasingly prominent focus in recent research. Some notable
directions have explored understanding self-attention through the lens of image
denoising and nonparametric regression. While promising, existing frameworks
still lack a deeper mechanistic interpretation of various architectural
components that enhance self-attention, both in its original formulation and
subsequent variants. In this work, we aim to advance this understanding by
developing a unifying image processing framework, capable of explaining not
only the self-attention computation itself but also the role of components such
as positional encoding and residual connections, including numerous later
variants. We also pinpoint potential distinctions between the two concepts
building upon our framework, and make effort to close this gap. We introduce
two independent architectural modifications within transformers. While our
primary objective is interpretability, we empirically observe that image
processing-inspired modifications can also lead to notably improved accuracy
and robustness against data contamination and adversaries across language and
vision tasks as well as better long sequence understanding.