DriveMM: All-in-One Large Multimodal Model for Autonomous Driving
Journal:
arXiv
Published Date:
Dec 10, 2024
Abstract
Large Multimodal Models (LMMs) have demonstrated exceptional comprehension
and interpretation capabilities in Autonomous Driving (AD) by incorporating
large language models. Despite the advancements, current data-driven AD
approaches tend to concentrate on a single dataset and specific tasks,
neglecting their overall capabilities and ability to generalize. To bridge
these gaps, we propose DriveMM, a general large multimodal model designed to
process diverse data inputs, such as images and multi-view videos, while
performing a broad spectrum of AD tasks, including perception, prediction, and
planning. Initially, the model undergoes curriculum pre-training to process
varied visual signals and perform basic visual comprehension and perception
tasks. Subsequently, we augment and standardize various AD-related datasets to
fine-tune the model, resulting in an all-in-one LMM for autonomous driving. To
assess the general capabilities and generalization ability, we conduct
evaluations on six public benchmarks and undertake zero-shot transfer on an
unseen dataset, where DriveMM achieves state-of-the-art performance across all
tasks. We hope DriveMM as a promising solution for future end-to-end autonomous
driving applications in the real world. Project page with code:
https://github.com/zhijian11/DriveMM.