Describing open-domain video with natural language sequence is a major challenge for computer vision. In this paper, we investigate how to use temporal information and learn linguistic knowledge for video description. Traditional convolutional neural networks (CNN) can only learn powerful spatial features in the videos, but they ignored underlying temporal features. To solve this problem, we extract SIFT flow features to get temporal information. Sequence generator of recent work are solely trained on text from video description datasets, so the sequence generated tend to show linguistic irregularities associated with a restricted language model and small vocabulary. For this, we transfer knowledge from large text corpora and employ word2vec to be the word representation. The experimental results have demonstrated that our model outperforms related work.
Previous Article in event
Next Article in event
Video Description with Spatio-temporal Feature and Knowledge Transferring
Published:
30 December 2016
by MDPI
in MOL2NET'16, Conference on Molecular, Biomed., Comput. & Network Science and Engineering, 2nd ed.
congress USEDAT-02: USA-Europe Data Analysis Training Program Workshop, Cambridge, UK-Bilbao, Spain-Miami, USA, 2016
Abstract:
Keywords: video description, SIFT flow, knowledge transferring, word2vec