Please login first
Video Description with Spatio-temporal Feature and Knowledge Transferring
, , , , *
1  School of Computer Science and Technology, Soochow University
2  Collaborative Innovation Center of Novel Software Technology and Industrialization
3  Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education


Describing open-domain video with natural language sequence is a major challenge for computer vision. In this paper, we investigate how to use temporal information and learn linguistic knowledge for video description. Traditional convolutional neural networks (CNN) can only learn powerful spatial features in the videos, but they ignored underlying temporal features. To solve this problem, we extract SIFT flow features to get temporal information. Sequence generator of recent work are solely trained on text from video description datasets, so the sequence generated tend to show linguistic irregularities associated with a restricted language model and small vocabulary. For this, we transfer knowledge from large text corpora and employ word2vec to be the word representation. The experimental results have demonstrated that our model outperforms related work.

Keywords: video description, SIFT flow, knowledge transferring, word2vec