Scene-dependent sound event detection based on multitask learning with deformable large kernel attention convolution.
Journal:
PloS one
PMID:
40343957
Abstract
Sound event detection (SED) and acoustic scene classification (ASC) are closely related tasks in environmental sound analysis. Given the interrelationship between sound events and scenes, some previous studies have proposed using the multitask learning (MTL) method to jointly analyze SED and ASC. However, these multitask learning methods are generally based on hard parameter-sharing, which exchange sound event and scene features only through the low-level network. Such approaches are difficult to balance the complex interrelationships between SED and ASC, and limits the feature sharing and information flow between tasks during the training. To address these challenges, this study proposes a novel multitask network based on residual multi-level feature extraction (R-MFE) framework, which aims to jointly analyze SED and ASC tasks, and utilize scene information to improve the performance of sound event detection. In addition, this study designs the D-LKAC attention module, which combines the advantages of self-attention mechanisms and convolution to capture global and local features. To further enhance SED performance, this study introduces the MS-conv module, which captures audio details from multiple dimensions. The proposed MTL method is evaluated on the TUT Acoustic Scenes 2016/2017 and TUT Sound Events 2016/2017 datasets. Experimental results indicate that our approach outperforms state-of-the-art techniques, improving the F-scores by 6.44%.