Dense dynamic convolutional network for Bel canto vocal technique assessment.
Journal:
Scientific reports
PMID:
40325118
Abstract
The Bel Canto performance is a complex and multidimensional art form encompassing pitch, timbre, technique, and affective expression. To accurately reflect a performer's singing proficiency, it is essential to quantify and evaluate their vocal technical execution precisely. Convolutional Neural Networks (CNNs), renowned for their robust ability to capture spatial hierarchical information, have been widely adopted in various tasks, including audio pattern recognition. However, existing CNNs exhibit limitations in extracting intricate spectral features, particularly in Bel Canto performance. To address the challenges posed by complex spectral features and meet the demands for objective vocal technique assessment, we introduce Omni-Dimensional Dynamic Convolution (ODConv). Additionally, we employ densely connected layers to optimize the framework, enabling efficient utilization of multi-scale features across multiple dynamic convolution layers. To validate the effectiveness of our method, we conducted experiments on tasks including vocal technique assessment, music classification, acoustic scene classification, and sound event detection. The experimental results demonstrate that our Dense Dynamic Convolutional Network (DDNet) outperforms traditional CNN and Transformer models, achieving 90.11%, 73.95%, and 89.31% (Top-1 Accuracy), and 41.89% (mAP), respectively. Our research not only significantly improves the accuracy and efficiency of Bel Canto vocal technique assessment but also facilitates applications in vocal teaching and remote education.