Multimodal Spatial Language Maps for Robot Navigation and Manipulation
Journal:
arXiv
Published Date:
Jun 7, 2025
Abstract
Grounding language to a navigating agent's observations can leverage
pretrained multimodal foundation models to match perceptions to object or event
descriptions. However, previous approaches remain disconnected from environment
mapping, lack the spatial precision of geometric maps, or neglect additional
modality information beyond vision. To address this, we propose multimodal
spatial language maps as a spatial map representation that fuses pretrained
multimodal features with a 3D reconstruction of the environment. We build these
maps autonomously using standard exploration. We present two instances of our
maps, which are visual-language maps (VLMaps) and their extension to
audio-visual-language maps (AVLMaps) obtained by adding audio information. When
combined with large language models (LLMs), VLMaps can (i) translate natural
language commands into open-vocabulary spatial goals (e.g., "in between the
sofa and TV") directly localized in the map, and (ii) be shared across
different robot embodiments to generate tailored obstacle maps on demand.
Building upon the capabilities above, AVLMaps extend VLMaps by introducing a
unified 3D spatial representation integrating audio, visual, and language cues
through the fusion of features from pretrained multimodal foundation models.
This enables robots to ground multimodal goal queries (e.g., text, images, or
audio snippets) to spatial locations for navigation. Additionally, the
incorporation of diverse sensory inputs significantly enhances goal
disambiguation in ambiguous environments. Experiments in simulation and
real-world settings demonstrate that our multimodal spatial language maps
enable zero-shot spatial and multimodal goal navigation and improve recall by
50% in ambiguous scenarios. These capabilities extend to mobile robots and
tabletop manipulators, supporting navigation and interaction guided by visual,
audio, and spatial cues.