Beyond traditional surveillance applications, sensor-based human action recognition and segmentation responds to a growing demand in the health and safety sector. Among state-of-the-art methods for sensor data analytics, deep learning approaches to human action segmentation can be broadly categorized by two main approaches, namely algorithms that input skeletal sequences, as well as video-based models with RGB, depth, and infrared inputs. Recently, skeletal action recognition has largely been dominated by spatio-temporal graph convolutional neural networks (ST-GCN), while video-based action segmentation has seen great performance using 3D convolutional neural networks (3D-CNN), as well as vision transformers. In this paper, we argue that these two inputs are complementary, and develop an approach that achieves superior performance with a multi-modal ensemble. Video action segmentation models typically compute features in an offline phase due to memory constraints inherent to 3D-CNNs, however graph CNNs do not suffer from this problem. Hence, a multi-task GCN is developed that can predict both frame-wise actions as well as sequence-wise action timestamps, allowing for the use of fine-tuned video classification models to classify action segments and achieve refined predictions. Symmetrically, a multi-task video approach is presented that uses a video action segmentation model to predict framewise labels and timestamps, augmented with a skeletal action classification model, yielding improved performance. Finally, an ensemble of segmentation methods for each modality (skeletal, RGB, depth, and infrared) is formulated. Experimental results yield 86% accuracy on the PKU-MMD v2 dataset, representing state-of-the-art performance while also addressing the related over-segmentation problem.
Previous Article in event
Next Article in event
Multi Modal Human Action Segmentation using Skeletal Video Ensembles
Published:
15 November 2023
by MDPI
in 10th International Electronic Conference on Sensors and Applications
session Sensors and Artificial Intelligence
Abstract:
Keywords: Action Segmentation; Deep Learning; Computer Vision; Graph Neural Networks; Convolutional Neural Networks;