The platform for open scholarly exchange and collaboration.

Home » MOL2NET » Section 11: SUIWML01: International Workshop on Machine Learning in Biomedicine, Soochow, 2016 » Paper

[] Trajectory-pooled Spatial-temporal Structure of Deep Convolutional Neural Networks for Video Event Recognition

1 School of Computer Science and Technology, Soochow University
2 College of mathematics physics and information engineering, Jiaxing University
3 School of Computer Science and Engineering, Changshu Institute of Science and Technology
4 Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University
5 Collaborative Innovation Center of Novel Software Technology and Industrialization
* Author to whom correspondence should be addressed.
3 January 2017
115 views
0/5 rated ( 0 ratings )

Abstract

Video event recognition according to content feature faces great challenges due to complex scenes and blurred actions for surveillance videos. To alleviate these challenges, we propose a spatial-temporal structure of deep Convolutional Neural Networks for video event recognition. By taking advantage of spatial-temporal information, we fine-tune a two-stream Network, then fuse spatial and temporal feature at a convolution layer using a conv fusion method to enforce the consistence of spatial-temporal structure. Based on the two-stream Network and spatial-temporal layer, we obtain a triple-channel structure. We pool the trajectory to the fused convolution layer, as the spatial-temporal channel. At the same time, trajectory-pooling is conducted on one spatial convolution layer and one temporal convolution layer, to form another two channels: spatial channel and temporal channel. To combine the merits of deep feature and hand-crafted feature, we implement trajectory-constrained pooling to HOG and HOF features. Trajectory-pooled HOG and HOF features are concatenated to spatial channel and temporal channel respectively. A fusion method on triple-channel is designed to obtain the final recognition result. The experiments on two surveillance video datasets including VIRAT 1.0 and VIRAT 2.0, which involves a suit of challenging events, such as person loading an object to a vehicle, person opening a vehicle trunk, manifest that the proposed method can achieve superior performance compared with other methods on these event benchmarks.

Keywords

video event recognition; CNN; spatial-temporal; trajectory-pooled; triple-channel

Cite this article as

Li, Y.; Wan, X.; Wang, Z.; Gong, S.; Liu, C. Trajectory-pooled Spatial-temporal Structure of Deep Convolutional Neural Networks for Video Event Recognition. In Proceedings of the MOL2NET, International Conference on Multidisciplinary Sciences, 25 December 2016–25 January 2017; Sciforum Electronic Conference Series, Vol. 2, 2016 ; doi:10.3390/mol2net-02-03857

Presentation

Comments on Trajectory-pooled Spatial-temporal Structure of Deep Convolutional Neural Networks for Video Event Recognition