SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models
Journal:
arXiv
Published Date:
Feb 28, 2025
Abstract
Traditional autonomous driving systems often struggle to integrate high-level
reasoning with low-level control, resulting in suboptimal and sometimes unsafe
driving behaviors. The emergence of Multimodal Large Language Models (MLLMs),
which can process both visual and textual data, presents an opportunity to
unify perception and reasoning tasks within a single framework. However,
effectively embedding precise safety knowledge into MLLMs for autonomous
driving remains a significant challenge. To address this, we propose SafeAuto,
a novel framework that enhances MLLM-based autonomous driving systems by
incorporating both unstructured and structured knowledge. Specifically, we
first introduce the Position-Dependent Cross-Entropy (PDCE) loss function,
designed to improve the accuracy of low-level control signal predictions when
numerical values are represented as text. Second, to ensure safe autonomous
driving by explicitly integrating precise safety knowledge into the MLLM, we
develop a reasoning component for SafeAuto. This component translates driving
safety regulations into first-order logic rules (e.g., "red light => stop") and
incorporates these rules into a probabilistic graphical model, such as a Markov
Logic Network (MLN). The MLN is trained to verify the predicted next actions
using environmental attributes identified by attribute recognition models
(e.g., detecting a red light) to form the predicates. Additionally, we
construct a Multimodal RAG model that leverages video data, control signals,
and environmental attributes to learn more effectively from past similar
driving experiences. By integrating PDCE, MLN, and Multimodal RAG, SafeAuto
significantly outperforms existing baselines across multiple datasets. This
advancement enables more accurate, reliable, and safer autonomous driving
systems that learn from experience, obey traffic laws, and perform precise
control actions.