Please login first
Multi Modal Human Action Segmentation using Skeletal Video Ensembles
* 1 , 2
1  PhD Student at the University of Ottawa
2  Professor at University of Ottawa in the department of Electrical Engineering and Computer Science
Academic Editor: Stefano Mariani


Beyond traditional surveillance applications, sensor-based human action recognition and segmentation responds to a growing demand in the health and safety sector. Among state-of-the-art methods for sensor data analytics, deep learning approaches to human action segmentation can be broadly categorized by two main approaches, namely algorithms that input skeletal sequences, as well as video-based models with RGB, depth, and infrared inputs. Recently, skeletal action recognition has largely been dominated by spatio-temporal graph convolutional neural networks (ST-GCN), while video-based action segmentation has seen great performance using 3D convolutional neural networks (3D-CNN), as well as vision transformers. In this paper, we argue that these two inputs are complementary, and develop an approach that achieves superior performance with a multi-modal ensemble. Video action segmentation models typically compute features in an offline phase due to memory constraints inherent to 3D-CNNs, however graph CNNs do not suffer from this problem. Hence, a multi-task GCN is developed that can predict both frame-wise actions as well as sequence-wise action timestamps, allowing for the use of fine-tuned video classification models to classify action segments and achieve refined predictions. Symmetrically, a multi-task video approach is presented that uses a video action segmentation model to predict framewise labels and timestamps, augmented with a skeletal action classification model, yielding improved performance. Finally, an ensemble of segmentation methods for each modality (skeletal, RGB, depth, and infrared) is formulated. Experimental results yield 86% accuracy on the PKU-MMD v2 dataset, representing state-of-the-art performance while also addressing the related over-segmentation problem.

Keywords: Action Segmentation; Deep Learning; Computer Vision; Graph Neural Networks; Convolutional Neural Networks;