UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation
Journal:
arXiv
Published Date:
Jun 4, 2025
Abstract
Cued Speech (CS) enhances lipreading through hand coding, providing precise
speech perception support for the hearing-impaired. CS Video-to-Speech
generation (CSV2S) task aims to convert the CS visual expressions (CS videos)
of hearing-impaired individuals into comprehensible speech signals. Direct
generation of speech from CS video (called single CSV2S) yields poor
performance due to insufficient CS data. Current research mostly focuses on CS
Recognition (CSR), which convert video content into linguistic text. Based on
this, one straightforward way of CSV2S is to combine CSR with a Text-to-Speech
system. This combined architecture relies on text as an intermediate medium for
stepwise cross-modal alignment, which may lead to error propagation and
temporal misalignment between speech and video dynamics. To address these
challenges, we propose a novel approach that directly generates speech from CS
videos without relying on intermediate text. Building upon this, we propose
UniCUE, the first unified framework for CSV2S, whose core innovation lies in
the integration of the CSR task that provides fine-grained visual-semantic
information to facilitate speech generation from CS videos. More precisely, (1)
a novel fine-grained semantic alignment pool to ensure precise mapping between
visual features and speech contents; (2) a VisioPhonetic adapter to bridge
cross-task representations, ensuring seamless compatibility between two
distinct tasks (i.e., CSV2S and CSR); (3) a pose-aware visual processor is
introduced to enhance fine-grained spatiotemporal correlations between lip and
hand movements in CS video. Experiments on our new established Chinese CS
dataset (14 cuers1: 8 hearing-impaired and 6 normal-hearing) show that our
UniCUE significantly reduces Word Error Rate by 78.3% and improves lip-speech
synchronization by 32% compared to the single CSV2S.