Emergence of human-like attention and distinct head clusters in self-supervised vision transformers: A comparative eye-tracking study.

Journal: Neural networks : the official journal of the International Neural Network Society
Published Date:

Abstract

Visual attention models aim to predict human gaze behavior, yet traditional saliency models and deep gaze prediction networks face limitations. Saliency models rely on handcrafted low-level visual features, often failing to capture human gaze dynamics, while deep learning-based gaze prediction models lack biological plausibility. Vision Transformers (ViTs), which use self-attention mechanisms, offer an alternative, but when trained with conventional supervised learning, their attention patterns tend to be dispersed and unfocused. This study demonstrates that ViTs trained with self-supervised DINO (self-Distillation with NO labels) develop structured attention that closely aligns with human gaze behavior when viewing videos. Our analysis reveals that self-attention heads in later layers of DINO-trained ViTs autonomously differentiate into three distinct clusters: (1) G1 heads (20%), which focus on key points within figures (e.g., the eyes of the main character) and resemble human gaze; (2) G2 heads (60%), which distribute attention over entire figures with sharp contours (e.g., the bodies of all characters); and (3) G3 heads (20%), which primarily attend to the background. These findings provide insights into how human overt attention and figure-ground segregation emerge in visual perception. Our work suggests that self-supervised learning enables ViTs to develop attention mechanisms that are more aligned with biological vision than traditional supervised training.

Authors

  • Takuto Yamamoto
    Department of Brain Physiology, Graduate School of Medicine, The University of Osaka, 1-3 Yamadaoka, Suita, Osaka, 565-0871, Japan.
  • Hirosato Akahoshi
    Dynamic Brain Network Laboratory, Graduate School of Frontier Biosciences, The University of Osaka, 1-3 Yamadaoka, Suita, Osaka, 565-0871, Japan.
  • Shigeru Kitazawa
    Department of Brain Physiology, Graduate School of Medicine, The University of Osaka, 1-3 Yamadaoka, Suita, Osaka, 565-0871, Japan; Dynamic Brain Network Laboratory, Graduate School of Frontier Biosciences, The University of Osaka, 1-3 Yamadaoka, Suita, Osaka, 565-0871, Japan; Center for Information and Neural Networks (CiNet), National Institute of Information and Communications Technology, 1-4 Yamadaoka, Suita, Osaka, 565-0871, Japan. Electronic address: kitazawa.shigeru.fbs@osaka-u.ac.jp.