VOCAL: Visual Odometry via ContrAstive Learning
Journal:
arXiv
Published Date:
Jun 30, 2025
Abstract
Breakthroughs in visual odometry (VO) have fundamentally reshaped the
landscape of robotics, enabling ultra-precise camera state estimation that is
crucial for modern autonomous systems. Despite these advances, many
learning-based VO techniques rely on rigid geometric assumptions, which often
fall short in interpretability and lack a solid theoretical basis within fully
data-driven frameworks. To overcome these limitations, we introduce VOCAL
(Visual Odometry via ContrAstive Learning), a novel framework that reimagines
VO as a label ranking challenge. By integrating Bayesian inference with a
representation learning framework, VOCAL organizes visual features to mirror
camera states. The ranking mechanism compels similar camera states to converge
into consistent and spatially coherent representations within the latent space.
This strategic alignment not only bolsters the interpretability of the learned
features but also ensures compatibility with multimodal data sources. Extensive
evaluations on the KITTI dataset highlight VOCAL's enhanced interpretability
and flexibility, pushing VO toward more general and explainable spatial
intelligence.