Benchmarking Automatic Speech Recognition Technology for Natural Language Samples of Children With and Without Developmental Delays.
Journal:
Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
PMID:
40039537
Abstract
Natural language sampling (NLS) offers rich insights into real-world speech and language usage across diverse groups; yet, human transcription is time-consuming and costly. Automatic speech recognition (ASR) technology has the potential to revolutionize NLS research. However, its performance in clinical-research settings with young children and those with developmental delays remains unknown. This study evaluates the OpenAI Whisper ASR model on n=34 NLS sessions of toddlers with and without language delays. Manual comparison of ASR to human transcriptions of children with Down Syndrome (DS; n=19; 2-5 years old) and typically-developing children (TD; n=15; 2-3 years old) revealed ASR accurately captured 50% of words spoken by TD children but only 14% for those with DS. About 20% of words were missed in both groups, and 21% (TD) and 6% (DS) of words were replaced. ASR also struggled with developmentally informative sounds, such as non-speech vocalizations, missing almost 50% in the DS data and misinterpreting most of the rest. While ASR shows potential in reducing transcription time, its limitations underscore the need for human-in-the-loop clinical machine learning systems, especially for underrepresented groups.