Audio-Driven Talking Face Video Generation with Joint Uncertainty Learning
Journal:
arXiv
Published Date:
Apr 26, 2025
Abstract
Talking face video generation with arbitrary speech audio is a significant
challenge within the realm of digital human technology. The previous studies
have emphasized the significance of audio-lip synchronization and visual
quality. Currently, limited attention has been given to the learning of visual
uncertainty, which creates several issues in existing systems, including
inconsistent visual quality and unreliable performance across different input
conditions. To address the problem, we propose a Joint Uncertainty Learning
Network (JULNet) for high-quality talking face video generation, which
incorporates a representation of uncertainty that is directly related to visual
error. Specifically, we first design an uncertainty module to individually
predict the error map and uncertainty map after obtaining the generated image.
The error map represents the difference between the generated image and the
ground truth image, while the uncertainty map is used to predict the
probability of incorrect estimates. Furthermore, to match the uncertainty
distribution with the error distribution through a KL divergence term, we
introduce a histogram technique to approximate the distributions. By jointly
optimizing error and uncertainty, the performance and robustness of our model
can be enhanced. Extensive experiments demonstrate that our method achieves
superior high-fidelity and audio-lip synchronization in talking face video
generation compared to previous methods.