Open-world surgical video generation via dual-visual diffusion and dual-annealed generation.

Journal: Neural networks : the official journal of the International Neural Network Society
Published Date:

Abstract

Surgical video generation has drawn increasing attention due to its potential in surgical tasks. Existing methods usually emphasize the fitting of the original motion patterns. They are unable to reasonably extrapolate the learned motion patterns and thus lack the ability to generate examples based on open-world prompts. To solve this problem, this paper designs a novel diffusion model. The specific designs include: (1) Based on the existing diffusion models guided by images and texts, we adopted a multi-channel encoding method for the guiding images, which includes an image encoding module based on the pre-trained VAE (Variational Auto-Encoder) and an image auxiliary encoding module pre-trained by CLIP (Contrastive Language-Image Pretraining). This further enhances the representational consistency between the text and image. (2) To encourage the model to explore the generation space, we propose the dual-annealed noise generation based on the prompt text and the prompt image. In the initial stage of the model's inference, more noise is added to the representations of the prompt text and the prompt image to encourage the generation model to fully explore the sampling space. And as the inference steps progress, the noise is gradually reduced, hence the guiding signal is strengthened to recover video details. Our experiments on the CholeTriplet dataset and the AutoLaparo dataset demonstrate that the model can generate high-quality videos under close-domain as well as open-domain conditions and effectively boost the performance on downstream tasks such as semi-supervised video classification, surgical triplet/phase recognition.

Authors

Keywords

No keywords available for this article.