Please login first
Trajectory-pooled Spatial-temporal Structure of Deep Convolutional Neural Networks for Video Event Recognition
1, 2 , 1 , 1 , * 1, 3 , * 1, 4, 5
1  School of Computer Science and Technology, Soochow University
2  College of mathematics physics and information engineering, Jiaxing University
3  School of Computer Science and Engineering, Changshu Institute of Science and Technology
4  Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University
5  Collaborative Innovation Center of Novel Software Technology and Industrialization


Video event recognition according to content feature faces great challenges due to complex scenes and blurred actions for surveillance videos. To alleviate these challenges, we propose a spatial-temporal structure of deep Convolutional Neural Networks for video event recognition. By taking advantage of spatial-temporal information, we fine-tune a two-stream Network, then fuse spatial and temporal feature at a convolution layer using a conv fusion method to enforce the consistence of spatial-temporal structure. Based on the two-stream Network and spatial-temporal layer, we obtain a triple-channel structure. We pool the trajectory to the fused convolution layer, as the spatial-temporal channel. At the same time, trajectory-pooling is conducted on one spatial convolution layer and one temporal convolution layer, to form another two channels: spatial channel and temporal channel. To combine the merits of deep feature and hand-crafted feature, we implement trajectory-constrained pooling to HOG and HOF features. Trajectory-pooled HOG and HOF features are concatenated to spatial channel and temporal channel respectively. A fusion method on triple-channel is designed to obtain the final recognition result. The experiments on two surveillance video datasets including VIRAT 1.0 and VIRAT 2.0, which involves a suit of challenging events, such as person loading an object to a vehicle, person opening a vehicle trunk, manifest that the proposed method can achieve superior performance compared with other methods on these event benchmarks.

Keywords: video event recognition; CNN; spatial-temporal; trajectory-pooled; triple-channel