MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization
Journal:
arXiv
Published Date:
Jul 6, 2025
Abstract
Camera relocalization, a cornerstone capability of modern computer vision,
accurately determines a camera's position and orientation (6-DoF) from images
and is essential for applications in augmented reality (AR), mixed reality
(MR), autonomous driving, delivery drones, and robotic navigation. Unlike
traditional deep learning-based methods that regress camera pose from images in
a single scene, which often lack generalization and robustness in diverse
environments, we propose MVL-Loc, a novel end-to-end multi-scene 6-DoF camera
relocalization framework. MVL-Loc leverages pretrained world knowledge from
vision-language models (VLMs) and incorporates multimodal data to generalize
across both indoor and outdoor settings. Furthermore, natural language is
employed as a directive tool to guide the multi-scene learning process,
facilitating semantic understanding of complex scenes and capturing spatial
relationships among objects. Extensive experiments on the 7Scenes and Cambridge
Landmarks datasets demonstrate MVL-Loc's robustness and state-of-the-art
performance in real-world multi-scene camera relocalization, with improved
accuracy in both positional and orientational estimates.