☆ 4.7 Article

Deep Hierarchical Encoder-Decoder Network for Image Captioning

IEEE TRANSACTIONS ON MULTIMEDIA (2019)

期刊

IEEE TRANSACTIONS ON MULTIMEDIA

卷 21, 期 11, 页码 2942-2956

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TMM.2019.2915033

关键词

Visualization; Semantics; Hidden Markov models; Decoding; Logic gates; Training; Computer architecture; Deep hierarchical structure; encoder-decoder; LSTM; image captioning; retrieval; vision-sentence

类别

Computer Science, Information Systems Computer Science, Software Engineering Telecommunications

资金

National Natural Science Foundation of China [91646207, 61773377, 61573352]
Beijing Natural Science Foundation [L172053]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Encoder-decoder models have been widely used in image captioning, and most of them are designed via single long short term memory (LSTM). The capacity of single-layer network, whose encoder and decoder are integrated together, is limited for such a complex task of image captioning. Moreover, how to effectively increase the vertical depth of encoder-decoder remains to be solved. To deal with these problems, a novel deep hierarchical encoder-decoder network is proposed for image captioning, where a deep hierarchical structure is explored to separate the functions of encoder and decoder. This model is capable of efficiently exerting the representation capacity of deep networks to fuse high level semantics of vision and language in generating captions. Specifically, visual representations in top levels of abstraction are simultaneously considered, and each of these levels is associated to one LSTM. The bottom-most LSTM is applied as the encoder of textual inputs. The application of the middle layer in encoder-decoder is to enhance the decoding ability of top-most LSTM. Furthermore, depending on the introduction of semantic enhancement module of image feature and distribution combine module of text feature, variants of architectures of our model are constructed to explore the impacts and mutual interactions among the visual representation, textual representations, and the output of the middle LSTM layer. Particularly, the framework is training under a reinforcement learning method to address the exposure bias problem between the training and the testing by the policy gradient optimization. Qualitative analyses indicate the process that our model translates image to sentence and further visualization presents the evolution of the hidden states from different hierarchical LSTMs over time. Extensive experiments demonstrate that our model outperforms current state-of-the-art models on three benchmark datasets: Flickr8K, Flickr30K, and MSCOCO. On both image captioning and retrieval tasks, our method achieves the best results. On MSCOCO captioning Leaderboard, our method also achieves superior performance.

Deep Hierarchical Encoder-Decoder Network for Image Captioning

期刊

IEEE TRANSACTIONS ON MULTIMEDIA

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Deep Hierarchical Encoder-Decoder Network for Image Captioning

期刊

IEEE TRANSACTIONS ON MULTIMEDIA

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文