Trajectory-pooled Spatial-temporal Structure of Deep Convolutional Neural Networks for Video Event Recognition

¹ School of Computer Science and Technology, Soochow University
² College of mathematics physics and information engineering, Jiaxing University
³ School of Computer Science and Engineering, Changshu Institute of Science and Technology
⁴ Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University
⁵ Collaborative Innovation Center of Novel Software Technology and Industrialization

Published: 03 January 2017 by MDPI in MOL2NET'16, Conference on Molecular, Biomed., Comput. & Network Science and Engineering, 2nd ed. congress USEDAT-02: USA-Europe Data Analysis Training Program Workshop, Cambridge, UK-Bilbao, Spain-Miami, USA, 2016

https://doi.org/10.3390/mol2net-02-03857

Abstract:

Video event recognition according to content feature faces great challenges due to complex scenes and blurred actions for surveillance videos. To alleviate these challenges, we propose a spatial-temporal structure of deep Convolutional Neural Networks for video event recognition. By taking advantage of spatial-temporal information, we fine-tune a two-stream Network, then fuse spatial and temporal feature at a convolution layer using a conv fusion method to enforce the consistence of spatial-temporal structure. Based on the two-stream Network and spatial-temporal layer, we obtain a triple-channel structure. We pool the trajectory to the fused convolution layer, as the spatial-temporal channel. At the same time, trajectory-pooling is conducted on one spatial convolution layer and one temporal convolution layer, to form another two channels: spatial channel and temporal channel. To combine the merits of deep feature and hand-crafted feature, we implement trajectory-constrained pooling to HOG and HOF features. Trajectory-pooled HOG and HOF features are concatenated to spatial channel and temporal channel respectively. A fusion method on triple-channel is designed to obtain the final recognition result. The experiments on two surveillance video datasets including VIRAT 1.0 and VIRAT 2.0, which involves a suit of challenging events, such as person loading an object to a vehicle, person opening a vehicle trunk, manifest that the proposed method can achieve superior performance compared with other methods on these event benchmarks.

Keywords: video event recognition; CNN; spatial-temporal; trajectory-pooled; triple-channel

View Poster

100 Reads