4.7 Article

Region-Aware Image Captioning via Interaction Learning

Journal

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCSVT.2021.3107035

Keywords

Visualization; Semantics; Task analysis; Proposals; Learning systems; Sports; Feature extraction; Region modeling; interaction learning; image captioning

Funding

  1. National Natural Science Foundation of China [61772359, 62002257]
  2. Grant of Tianjin New Generation Artificial Intelligence Major Program [19ZXZNGX00110, 18ZXZNGX00150]
  3. China Postdoctoral Science Foundation [2021M692395]

Ask authors/readers for more resources

Image captioning, one of the primary goals in computer vision, aims to automatically generate natural descriptions for images. This paper proposes a region-aware interaction learning method to explicitly capture the semantic correlations between regions and objects for word inference, effectively capturing contextual information.
Image captioning is one of the primary goals in computer vision which aims to automatically generate natural descriptions for images. Intuitively, human visual system can notice some stimulating regions at first glance, and then volitionally focus on interesting objects within the region. For example, to generate a free-form sentence about boy-catch-baseball, the visual region involving boy and baseball could be first attended and then guide the salient object discovery for the word-by-word generation. Till now, previous captioning works mainly rely on the object-wise modeling and ignore the rich regional patterns. To mitigate the drawback, this paper proposes the region-aware interaction learning method, which aims to explicitly capture the semantic correlations in the region and object dimensions for the word inference. First, given an image, we extract a set of regions which contain diverse objects and their relations. Second, we present the spatial-GCN interaction refining structure which can establish the connection between regions and objects to effectively capture contextual information. Third, we design the dual-attention interaction inference procedure, which enables attention to be calculated in region and object dimensions jointly for the word generation. Specifically, the guidance mechanism is proposed to selectively emphasize semantic inter-dependencies from region to object attentions. Extensive experiments on the MSCOCO dataset demonstrate the superiority of the proposed method. Additional ablation studies and visualization further validate its effectiveness.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available