Please login first

  • Open Access
  • 2 Reads
  • 0 Citations
  • 0 Recommendations

Trajectory-pooled Spatial-temporal Structure of Deep Convolutional Neural Networks for Video Event Recognition
Yonggang Li 1 , Xiaoyi Wan 2 , Zhaohui Wang 2 , Shengrong Gong 3 , Chunping Liu 4

1  School of Computer Science and Technology, Soochow University,College of mathematics physics and information engineering, Jiaxing University
2  School of Computer Science and Technology, Soochow University
3  School of Computer Science and Technology, Soochow University,School of Computer Science and Engineering, Changshu Institute of Science and Technology
4  School of Computer Science and Technology, Soochow University,Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University,Collaborative Innovation Center of Novel Software Technology and Industrialization

Published: 03 January 2017 by MDPI AG in Proceedings of MOL2NET 2016, International Conference on Multidisciplinary Sciences, 2nd edition in MOL2NET 2016, International Conference on Multidisciplinary Sciences, 2nd edition
MDPI AG, 10.3390/mol2net-02-03857
Abstract:

Video event recognition according to content feature faces great challenges due to complex scenes and blurred actions for surveillance videos. To alleviate these challenges, we propose a spatial-temporal structure of deep Convolutional Neural Networks for video event recognition. By taking advantage of spatial-temporal information, we fine-tune a two-stream Network, then fuse spatial and temporal feature at a convolution layer using a conv fusion method to enforce the consistence of spatial-temporal structure. Based on the two-stream Network and spatial-temporal layer, we obtain a triple-channel structure. We pool the trajectory to the fused convolution layer, as the spatial-temporal channel. At the same time, trajectory-pooling is conducted on one spatial convolution layer and one temporal convolution layer, to form another two channels: spatial channel and temporal channel. To combine the merits of deep feature and hand-crafted feature, we implement trajectory-constrained pooling to HOG and HOF features. Trajectory-pooled HOG and HOF features are concatenated to spatial channel and temporal channel respectively. A fusion method on triple-channel is designed to obtain the final recognition result. The experiments on two surveillance video datasets including VIRAT 1.0 and VIRAT 2.0, which involves a suit of challenging events, such as person loading an object to a vehicle, person opening a vehicle trunk, manifest that the proposed method can achieve superior performance compared with other methods on these event benchmarks.


Comments on this paper Get comment updates
Currently there are no comments available.