Proteus-ID: ID-Consistent and Motion-Coherent Video Customization
Journal:
arXiv
Published Date:
Jun 30, 2025
Abstract
Video identity customization seeks to synthesize realistic, temporally
coherent videos of a specific subject, given a single reference image and a
text prompt. This task presents two core challenges: (1) maintaining identity
consistency while aligning with the described appearance and actions, and (2)
generating natural, fluid motion without unrealistic stiffness. To address
these challenges, we introduce Proteus-ID, a novel diffusion-based framework
for identity-consistent and motion-coherent video customization. First, we
propose a Multimodal Identity Fusion (MIF) module that unifies visual and
textual cues into a joint identity representation using a Q-Former, providing
coherent guidance to the diffusion model and eliminating modality imbalance.
Second, we present a Time-Aware Identity Injection (TAII) mechanism that
dynamically modulates identity conditioning across denoising steps, improving
fine-detail reconstruction. Third, we propose Adaptive Motion Learning (AML), a
self-supervised strategy that reweights the training loss based on
optical-flow-derived motion heatmaps, enhancing motion realism without
requiring additional inputs. To support this task, we construct Proteus-Bench,
a high-quality dataset comprising 200K curated clips for training and 150
individuals from diverse professions and ethnicities for evaluation. Extensive
experiments demonstrate that Proteus-ID outperforms prior methods in identity
preservation, text alignment, and motion quality, establishing a new benchmark
for video identity customization. Codes and data are publicly available at
https://grenoble-zhang.github.io/Proteus-ID/.