Fusing Augmented Spatio-temporal Features for Action Recognition

¹ School of Computer Science and Technology, Soochow University
² Collaborative Innovation Center of Novel Software Technology and Industrialization
³ Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University
⁴ School of Computer Science and Engineering, Changshu Institute of Technology

Published: 30 December 2016 by MDPI in MOL2NET'16, Conference on Molecular, Biomed., Comput. & Network Science and Engineering, 2nd ed. congress USEDAT-02: USA-Europe Data Analysis Training Program Workshop, Cambridge, UK-Bilbao, Spain-Miami, USA, 2016

https://doi.org/10.3390/mol2net-02-03852

Abstract:

Visual features are vitally important for action recognition in videos. However, traditional features fail to effectively recognize actions for two reasons: on one hand, spatial features are not powerful enough to capture appearance information of complex video actions; on the other hand, important temporal details are always ignored when pooling and encoding. In this paper, we present a new architecture that fuses multiple augmented spatio-temporal features. In order to strengthen spatial features, we conduct crop and horizontal flip on original frame images. Then we feed these processed images into deep Two-Stream network to produce robust spatial representations. To get powerful temporal features, we employ fourier temporal pyramid (FTP) to capture three different levels of video context, including short-term level, medium-range level, and global-range level. At last, we fuse these augmented spatio-temporal features using canonical correlation analysis (CCA) method, which is capable to capture the correlation between these features. Experimental results on UCF101 dataset show that our method can achieve excellent performance for action recognition.

Keywords: action recognition; CNN features; fourier temporal pyramid; CCA fusion

View Poster

134 Reads