☆ 4.7 Article

Multimodal feature-wise co-attention method for visual question answering

INFORMATION FUSION (2021)

Journal

INFORMATION FUSION

Volume 73, Issue -, Pages 1-10

Publisher

ELSEVIER

DOI: 10.1016/j.inffus.2021.02.022

Keywords

Feature-wise attention learning; Deep learning; Multimodal feature fusion; Visual question answering (VQA)

Funding

National Natural Science Foundation of China [61672246, 61272068, 61672254]
Program for Hust Academic Frontier Youth Team, China
NVIDIA Corporation (United States)

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper introduces a novel neural network module named MulFA for feature-wise attention modeling, which shows promising experimental results in VQA. By introducing MulFA modules, an effective union feature-wise and spatial co-attention network UFSCAN model is constructed for VQA, achieving competitive performance with state-of-the-art models on VQA datasets.

VQA attracts lots of researchers in recent years. It could be potentially applied to the remote consultation of COVID-19. Attention mechanisms provide an effective way of utilizing visual and question information selectively in visual question and answering (VQA). The attention methods of existing VQA models generally focus on spatial dimension. In other words, the attention is modeled as spatial probabilities that re-weights the image region or word token features. However, feature-wise attention cannot be ignored, as image and question representations are organized in both spatial and feature-wise modes. Taking the question What is the color of the woman's hairfor example, identifying the hair color attribute feature is as important as focusing on the hair region. In this paper, we propose a novel neural network module named multimodal feature-wise attention module(MulFA) to model the feature-wise attention. Extensive experiments show that MulFA is capable of filtering representations for feature refinement and leads to improved performance. By introducing MulFA modules, we construct an effective union feature-wise and spatial co-attention network (UFSCAN) model for VQA. Our evaluation on two large-scale VQA datasets, VQA 1.0 and VQA 2.0, shows that UFSCAN achieves performance competitive with state-of-the-art models.

Multimodal feature-wise co-attention method for visual question answering

Journal

INFORMATION FUSION

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Multimodal feature-wise co-attention method for visual question answering

Journal

INFORMATION FUSION

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper