Visual features are vitally important for action recognition in videos. However, traditional features fail to effectively recognize actions for two reasons: on one hand, spatial features are not powerful enough to capture appearance information of complex video actions; on the other hand, important temporal details are always ignored when pooling and encoding. In this paper, we present a new architecture that fuses multiple augmented spatio-temporal features. In order to strengthen spatial features, we conduct crop and horizontal flip on original frame images. Then we feed these processed images into deep Two-Stream network to produce robust spatial representations. To get powerful temporal features, we employ fourier temporal pyramid (FTP) to capture three different levels of video context, including short-term level, medium-range level, and global-range level. At last, we fuse these augmented spatio-temporal features using canonical correlation analysis (CCA) method, which is capable to capture the correlation between these features. Experimental results on UCF101 dataset show that our method can achieve excellent performance for action recognition.
Previous Article in event
Next Article in event
Fusing Augmented Spatio-temporal Features for Action Recognition
Published: 30 December 2016 by MDPI in MOL2NET'16, Conference on Molecular, Biomed., Comput. & Network Science and Engineering, 2nd ed. congress USEDAT-02: USA-Europe Data Analysis Training Program Workshop, Cambridge, UK-Bilbao, Spain-Miami, USA, 2016
Keywords: action recognition; CNN features; fourier temporal pyramid; CCA fusion