EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation
Journal:
arXiv
Published Date:
Mar 24, 2025
Abstract
Conditional human animation transforms a static reference image into a
dynamic sequence by applying motion cues such as poses. These motion cues are
typically derived from video data but are susceptible to limitations including
low temporal resolution, motion blur, overexposure, and inaccuracies under
low-light conditions. In contrast, event cameras provide data streams with
exceptionally high temporal resolution, a wide dynamic range, and inherent
resistance to motion blur and exposure issues. In this work, we propose
EvAnimate, a framework that leverages event streams as motion cues to animate
static human images. Our approach employs a specialized event representation
that transforms asynchronous event streams into 3-channel slices with
controllable slicing rates and appropriate slice density, ensuring
compatibility with diffusion models. Subsequently, a dual-branch architecture
generates high-quality videos by harnessing the inherent motion dynamics of the
event streams, thereby enhancing both video quality and temporal consistency.
Specialized data augmentation strategies further enhance cross-person
generalization. Finally, we establish a new benchmarking, including simulated
event data for training and validation, and a real-world event dataset
capturing human actions under normal and extreme scenarios. The experiment
results demonstrate that EvAnimate achieves high temporal fidelity and robust
performance in scenarios where traditional video-derived cues fall short.