☆ 4.8 Article

MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2022)

Journal

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Volume 44, Issue 1, Pages 318-329

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TPAMI.2020.3004830

Keywords

Visual question answering; visual relation; attention mechanism; relation attention

Funding

National Key Research and Development Program of China [2018AAA0102200]
Sichuan Science and Technology Program, China [2018GZDZX0032, 2020YFS0057]
Fundamental Research Funds for the Central Universities [ZYGX2019Z015]
National Natural Science Foundation of China [61632007]
Dongguan Songshan Lake Introduction Programof Leading Innovative and Entrepreneurial Talents

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Visual Question Answering (VQA) is a task that aims to answer natural language questions about visual images. Existing approaches often use attention mechanisms to focus on relevant visual objects and consider the relationships between objects. However, these approaches have limitations in modeling complex object relationships and leveraging the cooperation between visual appearance and relationships. To address these issues, we propose a novel end-to-end VQA model, called Multi-modal Relation Attention Network (MRA-Net). The model combines textual and visual relations, utilizes self-guided word relation attention, and incorporates question-adaptive visual relation attention modules to improve performance and interpretability. Experimental results on multiple benchmark datasets demonstrate that our proposed model outperforms state-of-the-art approaches.

Visual Question Answering (VQA) is a task to answer natural language questions tied to the content of visual images. Most recent VQA approaches usually apply attention mechanism to focus on the relevant visual objects and/or consider the relations between objects via off-the-shelf methods in visual relation reasoning. However, they still suffer from several drawbacks. First, they mostly model the simple relations between objects, which results in many complicated questions cannot be answered correctly, because of failing to provide sufficient knowledge. Second, they seldom leverage the harmony cooperation of visual appearance feature and relation feature. To solve these problems, we propose a novel end-to-end VQA model, termed Multi-modal Relation Attention Network (MRA-Net). The proposed model explores both textual and visual relations to improve performance and interpretability. In specific, we devise 1) a self-guided word relation attention scheme, which explore the latent semantic relations between words; 2) two question-adaptive visual relation attention modules that can extract not only the fine-grained and precise binary relations between objects but also the more sophisticated trinary relations. Both kinds of question-related visual relations provide more and deeper visual semantics, thereby improving the visual reasoning ability of question answering. Furthermore, the proposed model also combines appearance feature with relation feature to reconcile the two types of features effectively. Extensive experiments on five large benchmark datasets, VQA-1.0, VQA-2.0, COCO-QA, VQA-CP v2, and TDIUC, demonstrate that our proposed model outperforms state-of-the-art approaches.

MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network

Journal

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network

Journal

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper