Low-cost computation for isolated sign language video recognition with multiple reservoir computing.

Journal: PloS one
Published Date:

Abstract

Sign language recognition (SLR) has the potential to bridge communication gaps and empower hearing-impaired communities. To ensure the portability and accessibility of the SLR system, its implementation on a portable, server-independent device becomes imperative. This approach facilitates usage in areas without internet connectivity, addressing the need for data privacy protection. Although deep neural network models are potent, their efficacy is hindered by computational constraints on edge devices. This study delves into reservoir computing (RC), which is renowned for its edge-friendly characteristics. Through leveraging RC, our objective is to craft a cost-effective SLR system optimized for operation on edge devices with limited resources. To enhance the recognition capabilities of RC, we introduce multiple reservoirs with distinct leak rates, extracting diverse features from input videos. Prior to feeding sign language videos into the RC, we employ preprocessing via MediaPipe. This step involves extracting the coordinates of the signer's body and hand locations, referred to as keypoints, and normalizing their spatial positions. This combined approach, which incorporates keypoint extraction via MediaPipe and normalization during preprocessing, enhances the SLR system's robustness against complex background effects and varying signer positions. Experimental results demonstrate that the integration of MediaPipe and multiple reservoirs yields competitive outcomes compared with deep recurrent neural and echo state networks and promises significantly lower training times. Our proposed MRC achieved accuracies of 60.35%, 84.65%, and 91.51% for the top-1, top-5, and top-10, respectively, on the WLASL100 dataset, outperforming the deep learning-based approaches Pose-TGCN and Pose-GRU. Furthermore, because of the RC characteristics, the training time was shortened to 52.7 s, compared with 20 h for I3D and the competitive inference time.

Authors

  • A R Syulistyo
    Graduate School of Life Science and Systems Engineering, Kyushu Institute of Technology 2-4 Hibikino, Wakamatsu, Kitakyushu, Japan.
  • Y Tanaka
    Graduate School of Life Science and Systems Engineering, Kyushu Institute of Technology 2-4 Hibikino, Wakamatsu, Kitakyushu, Japan.
  • D Pramanta
    Department of Information and Network Sciences, Kyushu Institute of Information Sciences, 6-3-1, Saifu, Dazaifu, Japan.
  • N Fuengfusin
    Graduate School of Life Science and Systems Engineering, Kyushu Institute of Technology 2-4 Hibikino, Wakamatsu, Kitakyushu, Japan.
  • H Tamukoh
    Graduate School of Life Science and Systems Engineering, Kyushu Institute of Technology 2-4 Hibikino, Wakamatsu, Kitakyushu, Japan.