Mask-RadarNet: Enhancing Transformer With Spatial-Temporal Semantic Context for Radar Object Detection in Autonomous Driving
Journal:
arXiv
Published Date:
Dec 20, 2024
Abstract
As a cost-effective and robust technology, automotive radar has seen steady
improvement during the last years, making it an appealing complement to
commonly used sensors like camera and LiDAR in autonomous driving. Radio
frequency data with rich semantic information are attracting more and more
attention. Most current radar-based models take radio frequency image sequences
as the input. However, these models heavily rely on convolutional neural
networks and leave out the spatial-temporal semantic context during the
encoding stage. To solve these problems, we propose a model called
Mask-RadarNet to fully utilize the hierarchical semantic features from the
input radar data. Mask-RadarNet exploits the combination of interleaved
convolution and attention operations to replace the traditional architecture in
transformer-based models. In addition, patch shift is introduced to the
Mask-RadarNet for efficient spatial-temporal feature learning. By shifting part
of patches with a specific mosaic pattern in the temporal dimension,
Mask-RadarNet achieves competitive performance while reducing the computational
burden of the spatial-temporal modeling. In order to capture the
spatial-temporal semantic contextual information, we design the class masking
attention module (CMAM) in our encoder. Moreover, a lightweight auxiliary
decoder is added to our model to aggregate prior maps generated from the CMAM.
Experiments on the CRUW dataset demonstrate the superiority of the proposed
method to some state-of-the-art radar-based object detection algorithms. With
relatively lower computational complexity and fewer parameters, the proposed
Mask-RadarNet achieves higher recognition accuracy for object detection in
autonomous driving.