4.7 Article

Generating Video Descriptions With Latent Topic Guidance

期刊

IEEE TRANSACTIONS ON MULTIMEDIA
卷 21, 期 9, 页码 2407-2418

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TMM.2019.2896515

关键词

Video Captioning; Latent Topic; Multimodal Ensemble

资金

  1. National Natural Science Foundation of China [61772535]
  2. National Key Research and Development Plan [2016YFB1001202]
  3. Research Foundation of Beijing Municipal Science & Technology Commission [Z181100008918002]

向作者/读者索取更多资源

Automatic video description generation (a.k.a video captioning) is one of the ultimate goals for video understanding. Despite the wide range of applications such as video indexing and retrieval etc., the video captioning task remains quite challenging due to the complexity and diversity of video content. First, open-domain videos cover a broad range of topics, which results in highly variable vocabularies and expression styles to describe the video contents. Second, videos naturally contain multiple modalities including image, motion, and acoustic media. The information provided by different modalities differs in different conditions. In this paper, we propose a novel topic-guided video captioning model to address the above-mentioned challenges in video captioning. Our model consists of two joint tasks, namely, latent topic generation and topic-guided caption generation. The topic generation task aims to automatically predict the latent topic of the video. Since there is no groundtruth topic information, we mine multimodal topics in an unsupervised fashion based on video contents and annotated captions, and then distill the topic distribution to a topic prediction model. In the topic-guided generation task, we employ the topic guidance for two purposes. The first is to narrow down the language complexity across topics, where we propose the topic-aware decoder to leverage the latent topics to induce topic-related language models. The decoder is also generic and can be integrated with a temporal attention mechanism. The second is to dynamically attend to important modalities by topics, where we propose a flexible topic-guided multimodal ensemble framework and use the topic gating network to determine the attention weights. The two tasks are correlated with each other, and they collaborate to generate more detailed and accurate video captions. Our extensive experiments on two public benchmark datasets MSR-VTT and Youtube2Text demonstrate the effectiveness of the proposed topic-guided video captioning system, which achieves state-of-the-art performance on both datasets.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据