期刊
IEEE TRANSACTIONS ON MULTIMEDIA
卷 21, 期 9, 页码 2407-2418出版社
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TMM.2019.2896515
关键词
Video Captioning; Latent Topic; Multimodal Ensemble
资金
- National Natural Science Foundation of China [61772535]
- National Key Research and Development Plan [2016YFB1001202]
- Research Foundation of Beijing Municipal Science & Technology Commission [Z181100008918002]
Automatic video description generation (a.k.a video captioning) is one of the ultimate goals for video understanding. Despite the wide range of applications such as video indexing and retrieval etc., the video captioning task remains quite challenging due to the complexity and diversity of video content. First, open-domain videos cover a broad range of topics, which results in highly variable vocabularies and expression styles to describe the video contents. Second, videos naturally contain multiple modalities including image, motion, and acoustic media. The information provided by different modalities differs in different conditions. In this paper, we propose a novel topic-guided video captioning model to address the above-mentioned challenges in video captioning. Our model consists of two joint tasks, namely, latent topic generation and topic-guided caption generation. The topic generation task aims to automatically predict the latent topic of the video. Since there is no groundtruth topic information, we mine multimodal topics in an unsupervised fashion based on video contents and annotated captions, and then distill the topic distribution to a topic prediction model. In the topic-guided generation task, we employ the topic guidance for two purposes. The first is to narrow down the language complexity across topics, where we propose the topic-aware decoder to leverage the latent topics to induce topic-related language models. The decoder is also generic and can be integrated with a temporal attention mechanism. The second is to dynamically attend to important modalities by topics, where we propose a flexible topic-guided multimodal ensemble framework and use the topic gating network to determine the attention weights. The two tasks are correlated with each other, and they collaborate to generate more detailed and accurate video captions. Our extensive experiments on two public benchmark datasets MSR-VTT and Youtube2Text demonstrate the effectiveness of the proposed topic-guided video captioning system, which achieves state-of-the-art performance on both datasets.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据