A multi-scale feature fusion gaze estimation model based on convolutional neural network and vision transformer.

Journal: Scientific reports
Published Date:

Abstract

To address ineffective feature fusion and feature loss in gaze estimation under unconstrained environments, this study proposes a multi-scale feature fusion model, CAF-ViT (Cross-Attention Fusion Vision Transformer). The model takes multi-scale face images as input, uses ResNet-18 to extract feature maps at different granularities, and introduces learnable Class Tokens per scale. In the fusion stage, Class Tokens of different scales first perform self-attention in their respective Transformer Encoders to aggregate local details and global semantics. Then, by swapping the token sequences and computing cross-attention, the model achieves bidirectional interaction and deep fusion of coarse- and fine-grained features. To further refine feature representation, an additional attention layer is added after cross-attention. It linearly transforms the original query vector with Sigmoid activation to generate new query weights, and linearly maps the attention output to new value vectors, improving the representation of task-relevant features. The fused Class Token is finally regressed to gaze direction via a multilayer perceptron. Experiments show estimation errors of [Formula: see text] on MPIIFaceGaze, representing an 8.6% improvement over the baseline hybrid CNN-Transformer, and errors of [Formula: see text] on EyeDiap and [Formula: see text] on Gaze360. Ablation studies validate the multi-scale fusion strategy and improved attention mechanism.

Authors

Keywords

No keywords available for this article.