Video event recognition according to content feature faces great challenges due to complex scenes and blurred actions for surveillance videos. To alleviate these challenges, we propose a spatial-temporal structure of deep Convolutional Neural Networks for video event recognition. By taking advantage of spatial-temporal information, we fine-tune a two-stream Network, then fuse spatial and temporal feature at a convolution layer using a conv fusion method to enforce the consistence of spatial-temporal structure. Based on the two-stream Network and spatial-temporal layer, we obtain a triple-channel structure. We pool the trajectory to the fused convolution layer, as the spatial-temporal channel. At the same time, trajectory-pooling is conducted on one spatial convolution layer and one temporal convolution layer, to form another two channels: spatial channel and temporal channel. To combine the merits of deep feature and hand-crafted feature, we implement trajectory-constrained pooling to HOG and HOF features. Trajectory-pooled HOG and HOF features are concatenated to spatial channel and temporal channel respectively. A fusion method on triple-channel is designed to obtain the final recognition result. The experiments on two surveillance video datasets including VIRAT 1.0 and VIRAT 2.0, which involves a suit of challenging events, such as person loading an object to a vehicle, person opening a vehicle trunk, manifest that the proposed method can achieve superior performance compared with other methods on these event benchmarks.
Previous Article in event
Next Article in event
Next Article in congress
Trajectory-pooled Spatial-temporal Structure of Deep Convolutional Neural Networks for Video Event Recognition
Published:
03 January 2017
by MDPI
in MOL2NET'16, Conference on Molecular, Biomed., Comput. & Network Science and Engineering, 2nd ed.
congress USEDAT-02: USA-Europe Data Analysis Training Program Workshop, Cambridge, UK-Bilbao, Spain-Miami, USA, 2016
Abstract:
Keywords: video event recognition; CNN; spatial-temporal; trajectory-pooled; triple-channel