Describing open-domain video with natural language sequence is a major challenge for computer vision. In this paper, we investigate how to use temporal information and learn linguistic knowledge for video description. Traditional convolutional neural networks (CNN) can only learn powerful spatial features in the videos, but they ignored underlying temporal features. To solve this problem, we extract SIFT flow features to get temporal information. Sequence generator of recent work are solely trained on text from video description datasets, so the sequence generated tend to show linguistic irregularities associated with a restricted language model and small vocabulary. For this, we transfer knowledge from large text corpora and employ word2vec to be the word representation. The experimental results have demonstrated that our model outperforms related work.
                    Previous Article in event
            
                            
    
                    Next Article in event
            
                            
                                                    
        
                    Video Description with Spatio-temporal Feature and Knowledge Transferring
                
                                    
                
                
                    Published:
30 December 2016
by MDPI
in MOL2NET'16, Conference on Molecular, Biomed., Comput. & Network Science and Engineering, 2nd ed.
congress USEDAT-02: USA-Europe Data Analysis Training Program Workshop, Cambridge, UK-Bilbao, Spain-Miami, USA, 2016
                
                                    
                
                
                    Abstract: 
                                    
                        Keywords: video description, SIFT flow, knowledge transferring, word2vec
                    
                
                
                
                 
         
            
 
        
    
    
         
    
    
         
    
    
         
    
    
         
    
