A Transformer-based Multimodal Fusion Model for Efficient Crowd Counting Using Visual and Wireless Signals
Journal:
arXiv
Published Date:
Apr 28, 2025
Abstract
Current crowd-counting models often rely on single-modal inputs, such as
visual images or wireless signal data, which can result in significant
information loss and suboptimal recognition performance. To address these
shortcomings, we propose TransFusion, a novel multimodal fusion-based
crowd-counting model that integrates Channel State Information (CSI) with image
data. By leveraging the powerful capabilities of Transformer networks,
TransFusion effectively combines these two distinct data modalities, enabling
the capture of comprehensive global contextual information that is critical for
accurate crowd estimation. However, while transformers are well capable of
capturing global features, they potentially fail to identify finer-grained,
local details essential for precise crowd counting. To mitigate this, we
incorporate Convolutional Neural Networks (CNNs) into the model architecture,
enhancing its ability to extract detailed local features that complement the
global context provided by the Transformer. Extensive experimental evaluations
demonstrate that TransFusion achieves high accuracy with minimal counting
errors while maintaining superior efficiency.