EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation
Journal:
arXiv
Published Date:
Dec 2, 2024
Abstract
This paper aims to bring fine-grained expression control while maintaining
high-fidelity identity in portrait generation. This is challenging due to the
mutual interference between expression and identity: (i) fine expression
control signals inevitably introduce appearance-related semantics (e.g., facial
contours, and ratio), which impact the identity of the generated portrait; (ii)
even coarse-grained expression control can cause facial changes that compromise
identity, since they all act on the face. These limitations remain unaddressed
by previous generation methods, which primarily rely on coarse control signals
or two-stage inference that integrates portrait animation. Here, we introduce
EmojiDiff, the first end-to-end solution that enables simultaneous control of
extremely detailed expression (RGB-level) and high-fidelity identity in
portrait generation. To address the above challenges, EmojiDiff adopts a
two-stage scheme involving decoupled training and fine-tuning. For decoupled
training, we innovate ID-irrelevant Data Iteration (IDI) to synthesize
cross-identity expression pairs by dividing and optimizing the processes of
maintaining expression and altering identity, thereby ensuring stable and
high-quality data generation. Training the model with this data, we effectively
disentangle fine expression features in the expression template from other
extraneous information (e.g., identity, skin). Subsequently, we present
ID-enhanced Contrast Alignment (ICA) for further fine-tuning. ICA achieves
rapid reconstruction and joint supervision of identity and expression
information, thus aligning identity representations of images with and without
expression control. Experimental results demonstrate that our method remarkably
outperforms counterparts, achieves precise expression control with highly
maintained identity, and generalizes well to various diffusion models.