Emergence of human-like attention and distinct head clusters in self-supervised vision transformers: A comparative eye-tracking study.

Journal: Neural networks : the official journal of the International Neural Network Society

Published Date: May 21, 2025

Abstract

Visual attention models aim to predict human gaze behavior, yet traditional saliency models and deep gaze prediction networks face limitations. Saliency models rely on handcrafted low-level visual features, often failing to capture human gaze dynamics, while deep learning-based gaze prediction models lack biological plausibility. Vision Transformers (ViTs), which use self-attention mechanisms, offer an alternative, but when trained with conventional supervised learning, their attention patterns tend to be dispersed and unfocused. This study demonstrates that ViTs trained with self-supervised DINO (self-Distillation with NO labels) develop structured attention that closely aligns with human gaze behavior when viewing videos. Our analysis reveals that self-attention heads in later layers of DINO-trained ViTs autonomously differentiate into three distinct clusters: (1) G1 heads (20%), which focus on key points within figures (e.g., the eyes of the main character) and resemble human gaze; (2) G2 heads (60%), which distribute attention over entire figures with sharp contours (e.g., the bodies of all characters); and (3) G3 heads (20%), which primarily attend to the background. These findings provide insights into how human overt attention and figure-ground segregation emerge in visual perception. Our work suggests that self-supervised learning enables ViTs to develop attention mechanisms that are more aligned with biological vision than traditional supervised training.

Authors

Takuto Yamamoto

Department of Brain Physiology, Graduate School of Medicine, The University of Osaka, 1-3 Yamadaoka, Suita, Osaka, 565-0871, Japan.
Hirosato Akahoshi

Dynamic Brain Network Laboratory, Graduate School of Frontier Biosciences, The University of Osaka, 1-3 Yamadaoka, Suita, Osaka, 565-0871, Japan.
Shigeru Kitazawa

Department of Brain Physiology, Graduate School of Medicine, The University of Osaka, 1-3 Yamadaoka, Suita, Osaka, 565-0871, Japan; Dynamic Brain Network Laboratory, Graduate School of Frontier Biosciences, The University of Osaka, 1-3 Yamadaoka, Suita, Osaka, 565-0871, Japan; Center for Information and Neural Networks (CiNet), National Institute of Information and Communications Technology, 1-4 Yamadaoka, Suita, Osaka, 565-0871, Japan. Electronic address: kitazawa.shigeru.fbs@osaka-u.ac.jp.

Keywords

Attention Deep Learning Eye-Tracking Technology Fixation, Ocular Head Humans Neural Networks, Computer Supervised Machine Learning

External Resources

View on PubMed Access via DOI PubMed (40424761)

Emergence of human-like attention and distinct head clusters in self-supervised vision transformers: A comparative eye-tracking study.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Emergence of human-like attention and distinct head clusters in self-supervised vision transformers: A comparative eye-tracking study.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals