3DBench: A scalable benchmark for object and scene-level instruction-tuning of 3D large language models.

Journal: Neural networks : the official journal of the International Neural Network Society
Published Date:

Abstract

Recent assessments of Multi-Modal Large Language Models (MLLMs) have been thorough. However, a detailed benchmark that integrates point cloud data with language for MLLMs remains absent, leading to superficial comparisons that obscure advancements in the nuanced capabilities of such models. Current benchmarks typically feature object-level classification, scene-level captioning, and visual grounding (VG) tasks. These tasks inadequately encapsulate the spatial perception and logical reasoning skills of MLLMs, nor do they permit a just and all-encompassing assessment of MLLMs with varied architectures. To address these gaps, we propose 3DBench, a novel fine-grained benchmark specifically designed for MLLMs. It encompasses ten tasks spanning object and scene levels and organizes these tasks into three evaluative categories: expression, perception, and reasoning. Additionally, we present a scalable approach for constructing 3D instruction-tuning datasets derived from simulation environments, resulting in a dataset with over 239k question-answer pairs covering twelve tasks and their respective point clouds. Using this high-quality dataset, we introduce the Bench-model, which integrates advanced detection models to significantly enhance MLLM performance. We compare Bench-model against open-sourced 3D LLMs, analyzing the impact of different model architectures, training protocols, and public datasets. These experimental outcomes provide crucial perspectives on existing research limitations and suggest potential rooms for future investigation. Codes and datasets are available at https://github.com/Inshsang/3DBench.

Authors

  • Tianci Hu
    Key Laboratory of Specialty Fiber Optics and Optical Access Networks, Joint International Research Laboratory of Specialty Fiber Optics and Advanced Communication, Shanghai Institute of Advanced Communication and Data Science, Shanghai University, Shanghai, 200400, China. Electronic address: yinsh@shu.edu.cn.
  • Junjie Zhang
    Department of Immunology, School of Basic Medical Sciences, Anhui Medical University, Hefei, 230032, PR China.
  • Yutao Rao
    Key Laboratory of Specialty Fiber Optics and Optical Access Networks, Joint International Research Laboratory of Specialty Fiber Optics and Advanced Communication, Shanghai Institute of Advanced Communication and Data Science, Shanghai University, Shanghai, 200400, China. Electronic address: ryt756227875@shu.edu.cn.
  • Dan Zeng
    Key Laboratory of Specialty Fiber Optics and Optical Access Networks, Joint International Research Laboratory of Specialty Fiber Optics and Advanced Communication, Shanghai Institute of Advanced Communication and Data Science, Shanghai University, Shanghai, 200400, China. Electronic address: dzeng@shu.edu.cn.
  • Hongwen Yu
    Key Laboratory of Specialty Fiber Optics and Optical Access Networks, Joint International Research Laboratory of Specialty Fiber Optics and Advanced Communication, Shanghai Institute of Advanced Communication and Data Science, Shanghai University, Shanghai, 200400, China. Electronic address: hw_yu@shu.edu.cn.
  • Xiaoshui Huang
    School of Public Health, Shanghai Jiao Tong University, China.