Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions
Journal:
arXiv
Published Date:
Feb 1, 2025
Abstract
Visual speech recognition remains an open research problem where different
challenges must be considered by dispensing with the auditory sense, such as
visual ambiguities, the inter-personal variability among speakers, and the
complex modeling of silence. Nonetheless, recent remarkable results have been
achieved in the field thanks to the availability of large-scale databases and
the use of powerful attention mechanisms. Besides, multiple languages apart
from English are nowadays a focus of interest. This paper presents noticeable
advances in automatic continuous lipreading for Spanish. First, an end-to-end
system based on the hybrid CTC/Attention architecture is presented. Experiments
are conducted on two corpora of disparate nature, reaching state-of-the-art
results that significantly improve the best performance obtained to date for
both databases. In addition, a thorough ablation study is carried out, where it
is studied how the different components that form the architecture influence
the quality of speech recognition. Then, a rigorous error analysis is carried
out to investigate the different factors that could affect the learning of the
automatic system. Finally, a new Spanish lipreading benchmark is consolidated.
Code and trained models are available at
https://github.com/david-gimeno/evaluating-end2end-spanish-lipreading.