MS25: Materials Science-Focused Benchmark Data Set for Machine Learning Interatomic Potentials.

Journal: Journal of chemical information and modeling

Published Date: Jul 31, 2025

Abstract

We present MS25, a benchmark data set for evaluating machine learning interatomic potentials (MLIPs) across diverse materials-relevant systems including MgO surfaces, liquid water, zeolites, a catalytic Pt surface reaction, high-entropy alloys (HEAs), and disordered Zr-oxides. Five MLIP architectures (MACE, NequIP, Allegro, MTP, and Torch-ANI) are trained and tested, focusing not only on traditional metrics (energies, forces, and stresses) but also explicitly validating derived physical observables such as lattice constants, volumes, and reaction barriers. We find that most models reach comparable accuracy on standard error metrics across the simple systems, although equivariant MLIPs offer 1.5-2× improvements over nonequivariant MLIPs in energy and force error for structurally complex or compositionally disordered environments such as HEAs and Zr-O systems. Our analysis highlights that low errors in energy and force predictions do not guarantee reliable observables, emphasizing the necessity of explicit validation. We demonstrate limitations in cross-framework transferability, as models trained on one zeolite framework (CHA) fail to reliably generalize to predictions of structurally distinct frameworks (e.g., MFI). Size-extensive tests show some dependence on system size for MgO, resulting from forced periodicity. The HEA and Zr-O data sets are identified as challenging tests for future benchmarks and MLIP model architecture developments as they show significant differentiation in error between MLIP architectures and are still relatively difficult at 1000 training images. Moving forward, we recommend that benchmarking efforts shift their focus from marginal accuracy improvements in energy and force errors toward identifying and understanding model failure modes, rigorously assessing transferability, and evaluating how their errors affect observable predictions. For researchers looking to choose an MLIP architecture, we suggest selecting equivariant MLIP architectures if the complexity of the system is a challenge. For simple materials problems, auxiliary features such as integration with molecular dynamics engines, trade-offs between computational data set generation cost vs MLIP inference speed, and framework integration may play a more important decision factor than small differences in error metrics that are unlikely to matter for production-level research.

Authors

Tristan Maxson

Department of Chemical and Biological Engineering, University of Alabama, Tuscaloosa, Alabama 35487, United States.
Ademola Soyemi

Department of Chemical and Biological Engineering, The University of Alabama, Tuscaloosa, Alabama 35487, United States.
Xinglong Zhang

Department of Hematology, The Fourth Affiliated Hospital of China Medical University, Shenyang, 110032 Liaoning Province, China.
Benjamin W J Chen

Institute of High Performance Computing (IHPC), Agency for Science, Technology, and Research (A*STAR), 1 Fusionopolis Way, #16-16 Connexis, Singapore 138632, Singapore.
Tibor Szilvási

Department of Chemical and Biological Engineering, The University of Alabama, Tuscaloosa, Alabama 35487, United States.

Keywords

No keywords available for this article.

External Resources

View on PubMed Access via DOI PubMed (40742204)

MS25: Materials Science-Focused Benchmark Data Set for Machine Learning Interatomic Potentials.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

MS25: Materials Science-Focused Benchmark Data Set for Machine Learning Interatomic Potentials.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals