Evaluating Mandarin tone pronunciation accuracy for second language learners using a ResNet-based Siamese network.
Journal:
Scientific reports
Published Date:
Jul 8, 2025
Abstract
Evaluating tone pronunciation is essential for helping second-language (L2) learners master the intricate nuances of Mandarin tones. This article introduces an innovative automatic evaluation method for Mandarin tone pronunciation that employs a Siamese network (SN), which integrates two branch networks with a modified architecture specifically designed for large-scale image recognition tasks. We compiled a specialized corpus utilizing open-access and meticulously curated Mandarin corpora to develop our model, including standard-accented and non-standard-accented Mandarin speech. We extracted the pitch contour for each Mandarin syllable and applied Local Weighted Regression to smooth it. The resulting smooth pitch contour was normalized on a scale from 0 to 5, adhering to a five-level tone scale. We identified two key features from the normalized pitch contour: a 40D vector (1D feature) and a [Formula: see text] binary pixel image (2D feature), effectively capturing each syllable's tonal characteristics. During the training phase, the SN was trained using paired tone features from two syllables and a label indicating whether their tones matched. This setup allowed the network to assess discrepancies between the paired tones, accurately identifying tone pronunciation errors relative to standard-accented Mandarin syllables. In the testing phase, we input the tone features of two syllables into the SN to evaluate the degree of discrepancy between their tones. To ensure the reliability of our approach, we conducted experiments with several models, including ResNet-18, VGG-16, AlexNet, and a custom-designed baseline. We evaluated the 1D and 2D features through a series of specially designed subjective and objective assessments to measure our model's effectiveness in predicting tone discrepancies. The results from our experiments across various models demonstrate that our proposed method effectively assesses tone discrepancies. The versatility of our approach is highlighted by the compatibility of both the 1D and 2D features with multiple models, with the 2D features showing exceptional consistency when paired with ResNet-18. In subjective evaluations, our model achieved a Mean Squared Error (MSE) of 2.295 and a Root Mean Squared Error (RMSE) of 1.515 compared to expert assessments. We recorded an MSE of 0.189 and an RMSE of 0.435 in objective evaluations. ResNet-18 exhibited remarkable stability and effectiveness when integrated with 2D features, laying a solid foundation for future research in tone evaluation aimed at Mandarin L2 learners.