Exploring the Use of Contrastive Language-Image Pre-Training for Human Posture Classification: Insights from Yoga Pose Analysis
Journal:
arXiv
Published Date:
Jan 13, 2025
Abstract
Accurate human posture classification in images and videos is crucial for
automated applications across various fields, including work safety, physical
rehabilitation, sports training, or daily assisted living. Recently, multimodal
learning methods, such as Contrastive Language-Image Pretraining (CLIP), have
advanced significantly in jointly understanding images and text. This study
aims to assess the effectiveness of CLIP in classifying human postures,
focusing on its application in yoga. Despite the initial limitations of the
zero-shot approach, applying transfer learning on 15,301 images (real and
synthetic) with 82 classes has shown promising results. The article describes
the full procedure for fine-tuning, including the choice for image description
syntax, models and hyperparameters adjustment. The fine-tuned CLIP model,
tested on 3826 images, achieves an accuracy of over 85%, surpassing the current
state-of-the-art of previous works on the same dataset by approximately 6%, its
training time being 3.5 times lower than what is needed to fine-tune a
YOLOv8-based model. For more application-oriented scenarios, with smaller
datasets of six postures each, containing 1301 and 401 training images, the
fine-tuned models attain an accuracy of 98.8% and 99.1%, respectively.
Furthermore, our experiments indicate that training with as few as 20 images
per pose can yield around 90% accuracy in a six-class dataset. This study
demonstrates that this multimodal technique can be effectively used for yoga
pose classification, and possibly for human posture classification, in general.
Additionally, CLIP inference time (around 7 ms) supports that the model can be
integrated into automated systems for posture evaluation, e.g., for developing
a real-time personal yoga assistant for performance assessment.