EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis
Journal:
arXiv
Published Date:
Feb 2, 2025
Abstract
3D Gaussian splatting-based talking head synthesis has recently gained
attention for its ability to render high-fidelity images with real-time
inference speed. However, since it is typically trained on only a short video
that lacks the diversity in facial emotions, the resultant talking heads
struggle to represent a wide range of emotions. To address this issue, we
propose a lip-aligned emotional face generator and leverage it to train our
EmoTalkingGaussian model. It is able to manipulate facial emotions conditioned
on continuous emotion values (i.e., valence and arousal); while retaining
synchronization of lip movements with input audio. Additionally, to achieve the
accurate lip synchronization for in-the-wild audio, we introduce a
self-supervised learning method that leverages a text-to-speech network and a
visual-audio synchronization network. We experiment our EmoTalkingGaussian on
publicly available videos and have obtained better results than
state-of-the-arts in terms of image quality (measured in PSNR, SSIM, LPIPS),
emotion expression (measured in V-RMSE, A-RMSE, V-SA, A-SA, Emotion Accuracy),
and lip synchronization (measured in LMD, Sync-E, Sync-C), respectively.