EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos
Journal:
arXiv
Published Date:
Apr 16, 2025
Abstract
Generating videos in the first-person perspective has broad application
prospects in the field of augmented reality and embodied intelligence. In this
work, we explore the cross-view video prediction task, where given an
exo-centric video, the first frame of the corresponding ego-centric video, and
textual instructions, the goal is to generate futur frames of the ego-centric
video. Inspired by the notion that hand-object interactions (HOI) in
ego-centric videos represent the primary intentions and actions of the current
actor, we present EgoExo-Gen that explicitly models the hand-object dynamics
for cross-view video prediction. EgoExo-Gen consists of two stages. First, we
design a cross-view HOI mask prediction model that anticipates the HOI masks in
future ego-frames by modeling the spatio-temporal ego-exo correspondence. Next,
we employ a video diffusion model to predict future ego-frames using the first
ego-frame and textual instructions, while incorporating the HOI masks as
structural guidance to enhance prediction quality. To facilitate training, we
develop an automated pipeline to generate pseudo HOI masks for both ego- and
exo-videos by exploiting vision foundation models. Extensive experiments
demonstrate that our proposed EgoExo-Gen achieves better prediction performance
compared to previous video prediction models on the Ego-Exo4D and H2O benchmark
datasets, with the HOI masks significantly improving the generation of hands
and interactive objects in the ego-centric videos.