How Much Do Large Language Models Know about Human Motion? A Case Study in 3D Avatar Control
Journal:
arXiv
Published Date:
May 23, 2025
Abstract
We explore Large Language Models (LLMs)' human motion knowledge through 3D
avatar control. Given a motion instruction, we prompt LLMs to first generate a
high-level movement plan with consecutive steps (High-level Planning), then
specify body part positions in each step (Low-level Planning), which we
linearly interpolate into avatar animations as a clear verification lens for
human evaluators. Through carefully designed 20 representative motion
instructions with full coverage of basic movement primitives and balanced body
part usage, we conduct comprehensive evaluations including human assessment of
both generated animations and high-level movement plans, as well as automatic
comparison with oracle positions in low-level planning. We find that LLMs are
strong at interpreting the high-level body movements but struggle with precise
body part positioning. While breaking down motion queries into atomic
components improves planning performance, LLMs have difficulty with multi-step
movements involving high-degree-of-freedom body parts. Furthermore, LLMs
provide reasonable approximation for general spatial descriptions, but fail to
handle precise spatial specifications in text, and the precise spatial-temporal
parameters needed for avatar control. Notably, LLMs show promise in
conceptualizing creative motions and distinguishing culturally-specific motion
patterns.