Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance
Journal:
arXiv
Published Date:
Mar 24, 2025
Abstract
Recent advances in diffusion models bring new vitality to visual content
creation. However, current text-to-video generation models still face
significant challenges such as high training costs, substantial data
requirements, and difficulties in maintaining consistency between given text
and motion of the foreground object. To address these challenges, we propose
mask-guided video generation, which can control video generation through mask
motion sequences, while requiring limited training data. Our model enhances
existing architectures by incorporating foreground masks for precise
text-position matching and motion trajectory control. Through mask motion
sequences, we guide the video generation process to maintain consistent
foreground objects throughout the sequence. Additionally, through a first-frame
sharing strategy and autoregressive extension approach, we achieve more stable
and longer video generation. Extensive qualitative and quantitative experiments
demonstrate that this approach excels in various video generation tasks, such
as video editing and generating artistic videos, outperforming previous methods
in terms of consistency and quality. Our generated results can be viewed in the
supplementary materials.