Stress-Testing Multimodal Foundation Models for Crystallographic Reasoning
Journal:
arXiv
Published Date:
Jun 16, 2025
Abstract
Evaluating foundation models for crystallographic reasoning requires
benchmarks that isolate generalization behavior while enforcing physical
constraints. This work introduces a multiscale multicrystal dataset with two
physically grounded evaluation protocols to stress-test multimodal generative
models. The Spatial-Exclusion benchmark withholds all supercells of a given
radius from a diverse dataset, enabling controlled assessments of spatial
interpolation and extrapolation. The Compositional-Exclusion benchmark omits
all samples of a specific chemical composition, probing generalization across
stoichiometries. Nine vision--language foundation models are prompted with
crystallographic images and textual context to generate structural annotations.
Responses are evaluated via (i) relative errors in lattice parameters and
density, (ii) a physics-consistency index penalizing volumetric violations, and
(iii) a hallucination score capturing geometric outliers and invalid
space-group predictions. These benchmarks establish a reproducible, physically
informed framework for assessing generalization, consistency, and reliability
in large-scale multimodal models. Dataset and code are available at
https://github.com/KurbanIntelligenceLab/StressTestingMMFMinCR.