Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation
Journal:
arXiv
Published Date:
Dec 24, 2024
Abstract
An image may convey a thousand words, but a video composed of hundreds or
thousands of image frames tells a more intricate story. Despite significant
progress in multimodal large language models (MLLMs), generating extended
videos remains a formidable challenge. As of this writing, OpenAI's Sora, the
current state-of-the-art system, is still limited to producing videos that are
up to one minute in length. This limitation stems from the complexity of long
video generation, which requires more than generative AI techniques for
approximating density functions essential aspects such as planning, story
development, and maintaining spatial and temporal consistency present
additional hurdles. Integrating generative AI with a divide-and-conquer
approach could improve scalability for longer videos while offering greater
control. In this survey, we examine the current landscape of long video
generation, covering foundational techniques like GANs and diffusion models,
video generation strategies, large-scale training datasets, quality metrics for
evaluating long videos, and future research areas to address the limitations of
the existing video generation capabilities. We believe it would serve as a
comprehensive foundation, offering extensive information to guide future
advancements and research in the field of long video generation.