Hear-Your-Click: Interactive Video-to-Audio Generation via Object-aware Contrastive Audio-Visual Fine-tuning
Journal:
arXiv
Published Date:
Jul 7, 2025
Abstract
Video-to-audio (V2A) generation shows great potential in fields such as film
production. Despite significant advances, current V2A methods, which rely on
global video information, struggle with complex scenes and often fail to
generate audio tailored to specific objects or regions in the videos. To
address these limitations, we introduce Hear-Your-Click, an interactive V2A
framework that enables users to generate sounds for specific objects in the
videos by simply clicking on the frame. To achieve this, we propose
Object-aware Contrastive Audio-Visual Fine-tuning (OCAV) with a Mask-guided
Visual Encoder (MVE) to obtain object-level visual features aligned with
corresponding audio segments. Furthermore, we tailor two data augmentation
strategies: Random Video Stitching (RVS) and Mask-guided Loudness Modulation
(MLM), aimed at enhancing the model's sensitivity to the segmented objects. To
effectively measure the audio-visual correspondence, we design a new evaluation
metric, the CAV score, for evaluation. Extensive experiments demonstrate that
our framework offers more precise control and improved generation performance
across various metrics. Project Page:
https://github.com/SynapGrid/Hear-Your-Click