EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models

Journal: arXiv

Published Date: May 14, 2025

Abstract

Recent advances in creative AI have enabled the synthesis of high-fidelity images and videos conditioned on language instructions. Building on these developments, text-to-video diffusion models have evolved into embodied world models (EWMs) capable of generating physically plausible scenes from language commands, effectively bridging vision and action in embodied AI applications. This work addresses the critical challenge of evaluating EWMs beyond general perceptual metrics to ensure the generation of physically grounded and action-consistent behaviors. We propose the Embodied World Model Benchmark (EWMBench), a dedicated framework designed to evaluate EWMs based on three key aspects: visual scene consistency, motion correctness, and semantic alignment. Our approach leverages a meticulously curated dataset encompassing diverse scenes and motion patterns, alongside a comprehensive multi-dimensional evaluation toolkit, to assess and compare candidate models. The proposed benchmark not only identifies the limitations of existing video generation models in meeting the unique requirements of embodied tasks but also provides valuable insights to guide future advancements in the field. The dataset and evaluation tools are publicly available at https://github.com/AgibotTech/EWMBench.

Authors

Hu Yue
Siyuan Huang
Yue Liao
Shengcong Chen
Pengfei Zhou
Liliang Chen
Maoqing Yao
Guanghui Ren

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2505.09694v2)

EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals