4.8 Article

MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/TPAMI.2020.3004830

Keywords

Visual question answering; visual relation; attention mechanism; relation attention

Funding

  1. National Key Research and Development Program of China [2018AAA0102200]
  2. Sichuan Science and Technology Program, China [2018GZDZX0032, 2020YFS0057]
  3. Fundamental Research Funds for the Central Universities [ZYGX2019Z015]
  4. National Natural Science Foundation of China [61632007]
  5. Dongguan Songshan Lake Introduction Programof Leading Innovative and Entrepreneurial Talents

Ask authors/readers for more resources

Visual Question Answering (VQA) is a task that aims to answer natural language questions about visual images. Existing approaches often use attention mechanisms to focus on relevant visual objects and consider the relationships between objects. However, these approaches have limitations in modeling complex object relationships and leveraging the cooperation between visual appearance and relationships. To address these issues, we propose a novel end-to-end VQA model, called Multi-modal Relation Attention Network (MRA-Net). The model combines textual and visual relations, utilizes self-guided word relation attention, and incorporates question-adaptive visual relation attention modules to improve performance and interpretability. Experimental results on multiple benchmark datasets demonstrate that our proposed model outperforms state-of-the-art approaches.
Visual Question Answering (VQA) is a task to answer natural language questions tied to the content of visual images. Most recent VQA approaches usually apply attention mechanism to focus on the relevant visual objects and/or consider the relations between objects via off-the-shelf methods in visual relation reasoning. However, they still suffer from several drawbacks. First, they mostly model the simple relations between objects, which results in many complicated questions cannot be answered correctly, because of failing to provide sufficient knowledge. Second, they seldom leverage the harmony cooperation of visual appearance feature and relation feature. To solve these problems, we propose a novel end-to-end VQA model, termed Multi-modal Relation Attention Network (MRA-Net). The proposed model explores both textual and visual relations to improve performance and interpretability. In specific, we devise 1) a self-guided word relation attention scheme, which explore the latent semantic relations between words; 2) two question-adaptive visual relation attention modules that can extract not only the fine-grained and precise binary relations between objects but also the more sophisticated trinary relations. Both kinds of question-related visual relations provide more and deeper visual semantics, thereby improving the visual reasoning ability of question answering. Furthermore, the proposed model also combines appearance feature with relation feature to reconcile the two types of features effectively. Extensive experiments on five large benchmark datasets, VQA-1.0, VQA-2.0, COCO-QA, VQA-CP v2, and TDIUC, demonstrate that our proposed model outperforms state-of-the-art approaches.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available