Events 10th International Electronic Conference on Sensors and Applications

Event submissions

Published

This submission belongs to the session E. Sensors and Artificial Intelligence of the event 10th International Electronic Conference on Sensors and Applications

Published date

15 Nov, 2023

Academic Editor

Stefano Mariani

Citation

James Dickens, Pierre Payeur, Multi Modal Human Action Segmentation using Skeletal Video Ensembles, in Proceedings of 10th International Electronic Conference on Sensors and Applications, 15 November–30 November 2023, MDPI: Basel, Switzerland, doi: 10.3390/ecsa-10-16257

Facebook

Twitter

Multi Modal Human Action Segmentation using Skeletal Video Ensembles

James Dickens ¹

Pierre Payeur ²

1. PhD Student at the University of Ottawa

2. Professor at University of Ottawa in the department of Electrical Engineering and Computer Science, Canada

Abstract

Beyond traditional surveillance applications, sensor-based human action recognition and segmentation responds to a growing demand in the health and safety sector. Among state-of-the-art methods for sensor data analytics, deep learning approaches to human action segmentation can be broadly categorized by two main approaches, namely algorithms that input skeletal sequences, as well as video-based models with RGB, depth, and infrared inputs. Recently, skeletal action recognition has largely been dominated by spatio-temporal graph convolutional neural networks (ST-GCN), while video-based action segmentation has seen great performance using 3D convolutional neural networks (3D-CNN), as well as vision transformers. In this paper, we argue that these two inputs are complementary, and develop an approach that achieves superior performance with a multi-modal ensemble. Video action segmentation models typically compute features in an offline phase due to memory constraints inherent to 3D-CNNs, however graph CNNs do not suffer from this problem. Hence, a multi-task GCN is developed that can predict both frame-wise actions as well as sequence-wise action timestamps, allowing for the use of fine-tuned video classification models to classify action segments and achieve refined predictions. Symmetrically, a multi-task video approach is presented that uses a video action segmentation model to predict framewise labels and timestamps, augmented with a skeletal action classification model, yielding improved performance. Finally, an ensemble of segmentation methods for each modality (skeletal, RGB, depth, and infrared) is formulated. Experimental results yield 86% accuracy on the PKU-MMD v2 dataset, representing state-of-the-art performance while also addressing the related over-segmentation problem.

Keywords

Action Segmentation

Deep Learning

Computer Vision

Graph Neural Networks

Convolutional Neural Networks

Manuscript

sciforum-078212-done.pdf

YOLO-NPK: A Light Deep Network for Lettuce Nutrients Deficiency Classification Based on Improved YOLOv8 Nano

Damage detection in machining tools using acoustic emission, signal processing and feature extraction