Enhancing realism in LiDAR scene generation with CSPA-DFN and linear cross-attention via Diffusion Transformer model.

Journal: Neural networks : the official journal of the International Neural Network Society
Published Date:

Abstract

Point cloud diffusion models have found extensive applications in autonomous driving and robotics. However, there is still a big gap between their generated LiDAR scene samples and real-world data in terms of visual quality. This discrepancy primarily arises from the loss of detailed information during the decoding process from latent space and the lack of guidance from global 3D structural information in the point cloud generation process, leading to distortions and artifacts in LiDAR scene samples. In this paper, we propose a novel LiDAR Diffusion Transformer Model that integrates Channel-Spatial Parallel Attention and Dilation Fusion Network (CSPA-DFN) with a linear cross-attention post-processing module to refine the generated LiDAR scene samples. Specifically, CSPA-DFN is designed to simultaneously emphasize detailed features across different channels and spatial locations in parallel, leveraging multi-scale dilated convolutions and channel grouping to preserve and enhance these detailed features. In order to provide global 3D structural information and balance performance and efficiency, we design a post-processing module that fuses voxelized features and range images using a linear ReLU cross-attention mechanism. Our approach is evaluated on the unconditional generation task using the KITTI-360 and nuScenes datasets, achieving the state-of-the-art results in LiDAR scene's generation quality. Furthermore, by incorporating semantic labels and camera views into the latent space, in addition to enhancing the model's semantic understanding capability for LiDAR scenes, our method also demonstrates additional performance improvements compared to previous works in terms of LiDAR scene's visual quality. The code implementation has been released on https://github.com/HITysx/LiDAR-Scene-Generation.

Authors

  • Shaoxun Ye
    Control and Simulation Center, Harbin Institute of Technology, Harbin 150080, China; National Key Laboratory of Modeling and Simulation for Complex Systems, Harbin 150080, China.
  • Xiaoguang Di
    Control and Simulation Center, Harbin Institute of Technology, Harbin 150080, China; National Key Laboratory of Modeling and Simulation for Complex Systems, Harbin 150080, China. Electronic address: dixiaoguang@hit.edu.cn.
  • Ming Liao
    Center for Genomic and Personalized Medicine, Guangxi Medical University, Nanning, Guangxi, China.
  • Ximing Li
    Tianjin Cardiovascular Institute, Tianjin Chest Hospital, Tianjin, China.