Hierarchical query design and distributed attention in transformer for player group activity recognition in sports analysis.
Journal:
Scientific reports
Published Date:
Aug 27, 2025
Abstract
Group activity recognition in sports analysis is a critical challenge in computer vision, requiring robust modeling of complex player interactions and dynamic scenarios. Existing approaches predominantly rely on region-based features and two-stage pipelines involving individual localization and activity classification. These methods are inherently limited by their dependency on accurate bounding box detection and often struggle with feature entanglement, occlusions, and the integration of broader contextual information. To address these gaps, this study introduces the hierarchical query design and distributed attention framework within a transformer architecture, tailored specifically for player group activity recognition in sports. The proposed model, named the hierarchical attention query transformer (HAQT), leverages a novel dual-pathway architecture to decouple individual and group activity recognition. By employing hierarchical query design, the framework ensures efficient disentanglement of individual and group-level features. In contrast, a distributed attention mechanism facilitates refined communication within and across player groups. Additionally, the deformable transformer backbone dynamically aggregates multi-scale spatiotemporal features, enhancing the model's robustness to occlusions, variable player formations, and motion dynamics. The proposed set prediction paradigm eliminates reliance on bounding box accuracy, enabling precise player localization and activity classification. Comprehensive experiments on Volleyball and Basketball-51 datasets validate the effectiveness of the HAQT. On the Volleyball dataset, HAQT achieves a state-of-the-art mean Average Precision (mAP) of 92.8% for group activity recognition, significantly surpassing existing models. On the Basketball-51 dataset, it achieves an impressive accuracy of 92.76%, demonstrating its superior ability to model complex spatiotemporal dependencies.