Real-time multiple spatiotemporal action localization and prediction approach using deep learning.

Journal: Neural networks : the official journal of the International Neural Network Society
Published Date:

Abstract

Detecting the locations of multiple actions in videos and classifying them in real-time are challenging problems termed "action localization and prediction" problem. Convolutional neural networks (ConvNets) have achieved great success for action localization and prediction in still images. A major advance occurred when the AlexNet architecture was introduced in the ImageNet competition. ConvNets have since achieved state-of-the-art performances across a wide variety of machine vision tasks, including object detection, image segmentation, image classification, facial recognition, human pose estimation, and tracking. However, few works exist that address action localization and prediction in videos. The current action localization research primarily focuses on the classification of temporally trimmed videos in which only one action occurs per frame. Moreover, nearly all the current approaches work only offline and are too slow to be useful in real-world environments. In this work, we propose a fast and accurate deep-learning approach to perform real-time action localization and prediction. The proposed approach uses convolutional neural networks to localize multiple actions and predict their classes in real time. This approach starts by using appearance and motion detection networks (known as "you only look once" (YOLO) networks) to localize and classify actions from RGB frames and optical flow frames using a two-stream model. We then propose a fusion step that increases the localization accuracy of the proposed approach. Moreover, we generate an action tube based on frame level detection. The frame by frame processing introduces an early action detection and prediction with top performance in terms of detection speed and precision. The experimental results demonstrate this superiority of our proposed approach in terms of both processing time and accuracy compared to recent offline and online action localization and prediction approaches on the challenging UCF-101-24 and J-HMDB-21 benchmarks.

Authors

  • Ahmed Ali Hammam
    Faculty of Computers and Artificial Intelligence, Cairo University, Egypt; Member of Scientific Research Group in Egypt (SRGE), Egypt.
  • Mona M Soliman
    Department of Botany and Microbiology, Faculty of Science, Cairo University, Giza, 12613, Egypt. monam6164@gmail.com.
  • Aboul Ella Hassanien
    Faculty of Computers and Information - Cairo University, Egypt. Electronic address: aboitcairo@gmail.com.