MSNeRV: Neural Video Representation with Multi-Scale Feature Fusion
Journal:
arXiv
Published Date:
Jun 18, 2025
Abstract
Implicit Neural representations (INRs) have emerged as a promising approach
for video compression, and have achieved comparable performance to the
state-of-the-art codecs such as H.266/VVC. However, existing INR-based methods
struggle to effectively represent detail-intensive and fast-changing video
content. This limitation mainly stems from the underutilization of internal
network features and the absence of video-specific considerations in network
design. To address these challenges, we propose a multi-scale feature fusion
framework, MSNeRV, for neural video representation. In the encoding stage, we
enhance temporal consistency by employing temporal windows, and divide the
video into multiple Groups of Pictures (GoPs), where a GoP-level grid is used
for background representation. Additionally, we design a multi-scale spatial
decoder with a scale-adaptive loss function to integrate multi-resolution and
multi-frequency information. To further improve feature extraction, we
introduce a multi-scale feature block that fully leverages hidden features. We
evaluate MSNeRV on HEVC ClassB and UVG datasets for video representation and
compression. Experimental results demonstrate that our model exhibits superior
representation capability among INR-based approaches and surpasses VTM-23.7
(Random Access) in dynamic scenarios in terms of compression efficiency.