Unified CNNs and transformers underlying learning mechanism reveals multi-head attention modus vivendi
Journal:
arXiv
Published Date:
Jan 22, 2025
Abstract
Convolutional neural networks (CNNs) evaluate short-range correlations in
input images which progress along the layers, whereas vision transformer (ViT)
architectures evaluate long-range correlations, using repeated transformer
encoders composed of fully connected layers. Both are designed to solve complex
classification tasks but from different perspectives. This study demonstrates
that CNNs and ViT architectures stem from a unified underlying learning
mechanism, which quantitatively measures the single-nodal performance (SNP) of
each node in feedforward (FF) and multi-head attention (MHA) sub-blocks. Each
node identifies small clusters of possible output labels, with additional noise
represented as labels outside these clusters. These features are progressively
sharpened along the transformer encoders, enhancing the signal-to-noise ratio.
This unified underlying learning mechanism leads to two main findings. First,
it enables an efficient applied nodal diagonal connection (ANDC) pruning
technique without affecting the accuracy. Second, based on the SNP, spontaneous
symmetry breaking occurs among the MHA heads, such that each head focuses its
attention on a subset of labels through cooperation among its SNPs.
Consequently, each head becomes an expert in recognizing its designated labels,
representing a quantitative MHA modus vivendi mechanism. This statistical
mechanics inspired viewpoint enables to reveal macroscopic behavior of the
entire network from the microscopic performance of each node. These results are
based on a compact convolutional transformer architecture trained on the
CIFAR-100 and Flowers-102 datasets and call for their extension to other
architectures and applications, such as natural language processing.