Inferring crowd crush accidents in typical high-density pedestrian movement zones via the vision-trajectory fusion neural network.
Journal:
Accident; analysis and prevention
Published Date:
Mar 2, 2026
Abstract
Pedestrian walking constitutes an indispensable mode of daily travel, yet recurrent high-density crowd gatherings in relevant facilities, e.g., holiday surges at railway stations, are widely recognized as high risk factors that can precipitate crowd crush accidents. To facilitate effective prevention, this study provides a foundational step by enabling precise inference of spatio-temporal crowd evolution characteristics in representative high-density movement zones via accurate pedestrian trajectory prediction. Technically, we develop a dual-modal data-driven framework, i.e., the Vision-Trajectory Cross Fusion Crowd Prediction Neural Network (VT-CrowdNet), which integrates an enhanced graph-learning module for structured trajectory modeling, an extended vision transformer for bird's-eye surveillance feature extraction, and an adapted cross-attention residual fusion mechanism for dual-modal integration, thereby enabling high-fidelity forecasting of individual trajectories that underpin subsequent multi-metric accident inference. The latter is conducted using region-scale indicators capturing crowd danger and transit efficiency, as well as global-scale metrics reflecting collective movement fluidity. Experiments are first conducted to evaluate VT-CrowdNet, alongside classical and state-of-the-art baselines, on microscopic trajectory prediction across representative high-density crowd movement zones, including unidirectional and bidirectional corridors, as well as the four-directional intersection. After validating VT-CrowdNet's performance at the microscopic level, the most complex zone, namely, the four-directional intersection, characterized by frequent and heterogeneous pedestrian interactions, is further selected to assess the model's capability in inferring spatio-temporal patterns of crowd crush risk at both regional and global scales. Results demonstrate that VT-CrowdNet not only achieves superior microscopic trajectory prediction across all movement zones but also consistently exhibits leading performance in inferring crowd crush related spatio-temporal characteristics. Two key insights emerge: i) under a dual-modal framework, performance consistently improves only when imagery and structured data are effectively aligned, whereas modal misalignment can adversely affect accuracy; ii) relatively accurate trajectory prediction does not necessarily ensure reliable inference of crowd crush characteristics and may mislead the management and intervention of the pedestrian traffic flow if based on erroneous outputs.
Authors
Keywords
No keywords available for this article.