4.6 Article

Cross-modality co-attention networks for visual question answering

Journal

SOFT COMPUTING
Volume 25, Issue 7, Pages 5411-5421

Publisher

SPRINGER
DOI: 10.1007/s00500-020-05539-7

Keywords

Visual question answering; Cross-modality co-attention; Computer vision

Funding

  1. National Natural Science Foundation of China [61672338, 61873160]

Ask authors/readers for more resources

Visual question answering (VQA) is an emerging task that combines natural language processing and computer vision technology. The proposed cross-modality co-attention network (CMCN) framework aims to improve learning both intra-modal and cross-modal relationships with a core module called cross-modality co-attention (CMC) composed of self-attention and guided-attention blocks. Experimental results show that CMCN outperforms existing methods on the VQA 2.0 dataset.
Visual question answering (VQA) is an emerging task combining natural language processing and computer vision technology. Selecting compelling multi-modality features is the core of visual question answering. In multi-modal learning, the attention network provides an effective way that selectively utilizes the given visual information. However, the internal relationship of modalities is often ignored in VQA, and most previous models focus on the relationship between visual and language features. To address such an issue: (1) we propose a cross-modality co-attention networks (CMCN) framework, such a network framework aims to help in learning both intra-modality and cross-modality relationships. (2) Cross-modality co-attention (CMC) module is the core of the whole network framework, composed of self-attention blocks and guided-attention blocks. The self-attention block learns the relations of intra-modalities, while the guided-attention block models cross-modal interactions between an image and a question. The cascaded network of multiple CMC modules not only improves the fusion of visual and language representations, but also captures more representative image and text information. (3) To prove that the proposed model can improve the results to some extent, we have carried out a thorough experimental verification. Experimental evaluations on the VQA 2.0 dataset confirm that the CMCN has significant performance advantages over existing methods.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available