☆ 4.6 Article

SPCA-Net: a based on spatial position relationship co-attention network for visual question answering

VISUAL COMPUTER (2022)

Journal

VISUAL COMPUTER

Volume 38, Issue 9-10, Pages 3097-3108

Publisher

SPRINGER

DOI: 10.1007/s00371-022-02524-z

Keywords

BERT; Guided-attention; Self-attention; Faster R-CNN; Spatial position relationship

Funding

National Natural Science Foundation of China [U1911401]
Ministry of Science and Technology of China [ZDI135-96]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper proposes an effective deep co-attention network that addresses the issue of VQA models not considering the spatial relationship between image region features. By introducing BERT and spatial location relationship, the model enables fine-grained interactions between question and image.

Recently, the latest method of VQA (visual question answering) mainly relies on the co-attention to link each visual object with the text object, which can achieve a rough interaction between multiple models. However, VQA models tend to focus on the association between visual and language features without considering the spatial relationship between image region features extracted by Faster R-CNN. This paper proposes an effective deep co-attention network to solve this problem. As a first step, BERT was introduced in order to better capture the relationship between words and make the extracted text feature more robust; secondly, a multimodal co-attention based on spatial location relationship was proposed in order to realize fine-grained interactions between question and image. It consists of three basic components: the text self-attention unit, the image self-attention unit, and the question-guided-attention unit. The self-attention mechanism of image visual features integrates information about the spatial position and width/height of the image area after obtaining attention so that each image area is aware of the relative location and size of other areas. Our experiment results indicate that our model is significantly better than other existing models.

SPCA-Net: a based on spatial position relationship co-attention network for visual question answering

Journal

VISUAL COMPUTER

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

SPCA-Net: a based on spatial position relationship co-attention network for visual question answering

Journal

VISUAL COMPUTER

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper