Causally Steered Diffusion for Automated Video Counterfactual Generation
Journal:
arXiv
Published Date:
Jun 17, 2025
Abstract
Adapting text-to-image (T2I) latent diffusion models for video editing has
shown strong visual fidelity and controllability, but challenges remain in
maintaining causal relationships in video content. Edits affecting causally
dependent attributes risk generating unrealistic or misleading outcomes if
these relationships are ignored. In this work, we propose a causally faithful
framework for counterfactual video generation, guided by a vision-language
model (VLM). Our method is agnostic to the underlying video editing system and
does not require access to its internal mechanisms or finetuning. Instead, we
guide the generation by optimizing text prompts based on an assumed causal
graph, addressing the challenge of latent space control in LDMs. We evaluate
our approach using standard video quality metrics and counterfactual-specific
criteria, such as causal effectiveness and minimality. Our results demonstrate
that causally faithful video counterfactuals can be effectively generated
within the learned distribution of LDMs through prompt-based causal steering.
With its compatibility with any black-box video editing system, our method
holds significant potential for generating realistic "what-if" video scenarios
in diverse areas such as healthcare and digital media.