Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation
Journal:
arXiv
Published Date:
Feb 20, 2025
Abstract
Recent advancements in Large Language Models (LLMs) and Vision-Language
Models (VLMs) have made them powerful tools in embodied navigation, enabling
agents to leverage commonsense and spatial reasoning for efficient exploration
in unfamiliar environments. Existing LLM-based approaches convert global
memory, such as semantic or topological maps, into language descriptions to
guide navigation. While this improves efficiency and reduces redundant
exploration, the loss of geometric information in language-based
representations hinders spatial reasoning, especially in intricate
environments. To address this, VLM-based approaches directly process
ego-centric visual inputs to select optimal directions for exploration.
However, relying solely on a first-person perspective makes navigation a
partially observed decision-making problem, leading to suboptimal decisions in
complex environments. In this paper, we present a novel vision-language model
(VLM)-based navigation framework that addresses these challenges by adaptively
retrieving task-relevant cues from a global memory module and integrating them
with the agent's egocentric observations. By dynamically aligning global
contextual information with local perception, our approach enhances spatial
reasoning and decision-making in long-horizon tasks. Experimental results
demonstrate that the proposed method surpasses previous state-of-the-art
approaches in object navigation tasks, providing a more effective and scalable
solution for embodied navigation.