☆ 4.7 Article

Region-Aware Image Captioning via Interaction Learning

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2022)

Journal

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

Volume 32, Issue 6, Pages 3685-3696

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TCSVT.2021.3107035

Keywords

Visualization; Semantics; Task analysis; Proposals; Learning systems; Sports; Feature extraction; Region modeling; interaction learning; image captioning

Funding

National Natural Science Foundation of China [61772359, 62002257]
Grant of Tianjin New Generation Artificial Intelligence Major Program [19ZXZNGX00110, 18ZXZNGX00150]
China Postdoctoral Science Foundation [2021M692395]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Image captioning, one of the primary goals in computer vision, aims to automatically generate natural descriptions for images. This paper proposes a region-aware interaction learning method to explicitly capture the semantic correlations between regions and objects for word inference, effectively capturing contextual information.

Image captioning is one of the primary goals in computer vision which aims to automatically generate natural descriptions for images. Intuitively, human visual system can notice some stimulating regions at first glance, and then volitionally focus on interesting objects within the region. For example, to generate a free-form sentence about boy-catch-baseball, the visual region involving boy and baseball could be first attended and then guide the salient object discovery for the word-by-word generation. Till now, previous captioning works mainly rely on the object-wise modeling and ignore the rich regional patterns. To mitigate the drawback, this paper proposes the region-aware interaction learning method, which aims to explicitly capture the semantic correlations in the region and object dimensions for the word inference. First, given an image, we extract a set of regions which contain diverse objects and their relations. Second, we present the spatial-GCN interaction refining structure which can establish the connection between regions and objects to effectively capture contextual information. Third, we design the dual-attention interaction inference procedure, which enables attention to be calculated in region and object dimensions jointly for the word generation. Specifically, the guidance mechanism is proposed to selectively emphasize semantic inter-dependencies from region to object attentions. Extensive experiments on the MSCOCO dataset demonstrate the superiority of the proposed method. Additional ablation studies and visualization further validate its effectiveness.

Region-Aware Image Captioning via Interaction Learning

Journal

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Region-Aware Image Captioning via Interaction Learning

Journal

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper