SITE: towards Spatial Intelligence Thorough Evaluation
Journal:
arXiv
Published Date:
May 8, 2025
Abstract
Spatial intelligence (SI) represents a cognitive ability encompassing the
visualization, manipulation, and reasoning about spatial relationships,
underpinning disciplines from neuroscience to robotics. We introduce SITE, a
benchmark dataset towards SI Thorough Evaluation in a standardized format of
multi-choice visual question-answering, designed to assess large
vision-language models' spatial intelligence across diverse visual modalities
(single-image, multi-image, and video) and SI factors (figural to environmental
scales, spatial visualization and orientation, intrinsic and extrinsic, static
and dynamic). Our approach to curating the benchmark combines a bottom-up
survey about 31 existing datasets and a top-down strategy drawing upon three
classification systems in cognitive science, which prompt us to design two
novel types of tasks about view-taking and dynamic scenes. Extensive
experiments reveal that leading models fall behind human experts especially in
spatial orientation, a fundamental SI factor. Moreover, we demonstrate a
positive correlation between a model's spatial reasoning proficiency and its
performance on an embodied AI task.