RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving
Journal:
arXiv
Published Date:
Mar 18, 2025
Abstract
Accurately understanding and deciding high-level meta-actions is essential
for ensuring reliable and safe autonomous driving systems. While
vision-language models (VLMs) have shown significant potential in various
autonomous driving tasks, they often suffer from limitations such as inadequate
spatial perception and hallucination, reducing their effectiveness in complex
autonomous driving scenarios. To address these challenges, we propose a
retrieval-augmented decision-making (RAD) framework, a novel architecture
designed to enhance VLMs' capabilities to reliably generate meta-actions in
autonomous driving scenes. RAD leverages a retrieval-augmented generation (RAG)
pipeline to dynamically improve decision accuracy through a three-stage process
consisting of the embedding flow, retrieving flow, and generating flow.
Additionally, we fine-tune VLMs on a specifically curated dataset derived from
the NuScenes dataset to enhance their spatial perception and bird's-eye view
image comprehension capabilities. Extensive experimental evaluations on the
curated NuScenes-based dataset demonstrate that RAD outperforms baseline
methods across key evaluation metrics, including match accuracy, and F1 score,
and self-defined overall score, highlighting its effectiveness in improving
meta-action decision-making for autonomous driving tasks.