Learning to Generate Long-term Future Narrations Describing Activities of Daily Living
Journal:
arXiv
Published Date:
Mar 3, 2025
Abstract
Anticipating future events is crucial for various application domains such as
healthcare, smart home technology, and surveillance. Narrative event
descriptions provide context-rich information, enhancing a system's future
planning and decision-making capabilities. We propose a novel task:
$\textit{long-term future narration generation}$, which extends beyond
traditional action anticipation by generating detailed narrations of future
daily activities. We introduce a visual-language model, ViNa, specifically
designed to address this challenging task. ViNa integrates long-term videos and
corresponding narrations to generate a sequence of future narrations that
predict subsequent events and actions over extended time horizons. ViNa extends
existing multimodal models that perform only short-term predictions or describe
observed videos by generating long-term future narrations for a broader range
of daily activities. We also present a novel downstream application that
leverages the generated narrations called future video retrieval to help users
improve planning for a task by visualizing the future. We evaluate future
narration generation on the largest egocentric dataset Ego4D.