Video Description with Spatio-temporal Feature and Knowledge Transferring

Xin Xu; Haibin Liu; Yi Ji; Xin Lin; Chunping Liu

doi:10.3390/mol2net-02-03851

Previous Article in event

Building Domain-Specific Sentiment Lexicon by Sentiment Seed Expansion

Next Article in event

Fusing Augmented Spatio-temporal Features for Action Recognition

Video Description with Spatio-temporal Feature and Knowledge Transferring

¹ School of Computer Science and Technology, Soochow University
² Collaborative Innovation Center of Novel Software Technology and Industrialization
³ Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education

Published: 30 December 2016 by MDPI in MOL2NET'16, Conference on Molecular, Biomed., Comput. & Network Science and Engineering, 2nd ed. congress USEDAT-02: USA-Europe Data Analysis Training Program Workshop, Cambridge, UK-Bilbao, Spain-Miami, USA, 2016

https://doi.org/10.3390/mol2net-02-03851

Abstract:

Describing open-domain video with natural language sequence is a major challenge for computer vision. In this paper, we investigate how to use temporal information and learn linguistic knowledge for video description. Traditional convolutional neural networks (CNN) can only learn powerful spatial features in the videos, but they ignored underlying temporal features. To solve this problem, we extract SIFT flow features to get temporal information. Sequence generator of recent work are solely trained on text from video description datasets, so the sequence generated tend to show linguistic irregularities associated with a restricted language model and small vocabulary. For this, we transfer knowledge from large text corpora and employ word2vec to be the word representation. The experimental results have demonstrated that our model outperforms related work.

Keywords: video description, SIFT flow, knowledge transferring, word2vec

View Poster

121 Reads