WhisperD: Dementia Speech Recognition and Filler Word Detection with Whisper
Journal:
arXiv
Published Date:
May 25, 2025
Abstract
Whisper fails to correctly transcribe dementia speech because persons with
dementia (PwDs) often exhibit irregular speech patterns and disfluencies such
as pauses, repetitions, and fragmented sentences. It was trained on standard
speech and may have had little or no exposure to dementia-affected speech.
However, correct transcription is vital for dementia speech for cost-effective
diagnosis and the development of assistive technology. In this work, we
fine-tune Whisper with the open-source dementia speech dataset (DementiaBank)
and our in-house dataset to improve its word error rate (WER). The fine-tuning
also includes filler words to ascertain the filler inclusion rate (FIR) and F1
score. The fine-tuned models significantly outperformed the off-the-shelf
models. The medium-sized model achieved a WER of 0.24, outperforming previous
work. Similarly, there was a notable generalisability to unseen data and speech
patterns.