☆ 4.6 Article

Cross-modality co-attention networks for visual question answering

SOFT COMPUTING (2021)

Journal

SOFT COMPUTING

Volume 25, Issue 7, Pages 5411-5421

Publisher

SPRINGER

DOI: 10.1007/s00500-020-05539-7

Keywords

Visual question answering; Cross-modality co-attention; Computer vision

Funding

National Natural Science Foundation of China [61672338, 61873160]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Visual question answering (VQA) is an emerging task that combines natural language processing and computer vision technology. The proposed cross-modality co-attention network (CMCN) framework aims to improve learning both intra-modal and cross-modal relationships with a core module called cross-modality co-attention (CMC) composed of self-attention and guided-attention blocks. Experimental results show that CMCN outperforms existing methods on the VQA 2.0 dataset.

Visual question answering (VQA) is an emerging task combining natural language processing and computer vision technology. Selecting compelling multi-modality features is the core of visual question answering. In multi-modal learning, the attention network provides an effective way that selectively utilizes the given visual information. However, the internal relationship of modalities is often ignored in VQA, and most previous models focus on the relationship between visual and language features. To address such an issue: (1) we propose a cross-modality co-attention networks (CMCN) framework, such a network framework aims to help in learning both intra-modality and cross-modality relationships. (2) Cross-modality co-attention (CMC) module is the core of the whole network framework, composed of self-attention blocks and guided-attention blocks. The self-attention block learns the relations of intra-modalities, while the guided-attention block models cross-modal interactions between an image and a question. The cascaded network of multiple CMC modules not only improves the fusion of visual and language representations, but also captures more representative image and text information. (3) To prove that the proposed model can improve the results to some extent, we have carried out a thorough experimental verification. Experimental evaluations on the VQA 2.0 dataset confirm that the CMCN has significant performance advantages over existing methods.

Cross-modality co-attention networks for visual question answering

Journal

SOFT COMPUTING

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Cross-modality co-attention networks for visual question answering

Journal

SOFT COMPUTING

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper