☆ 4.5 Article

Research on Visual Question Answering Based on GAT Relational Reasoning

NEURAL PROCESSING LETTERS (2022)

Journal

NEURAL PROCESSING LETTERS

Volume 54, Issue 2, Pages 1435-1448

Publisher

SPRINGER

DOI: 10.1007/s11063-021-10689-2

Keywords

Visual question answering; Relational reasoning; Attention mechanism; Graph attention network; Multi-model feature fusion

Funding

NSFC [62076200, 2020JM-468]
Shaanxi Natural Science Foundation

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

In this paper, a Graph Attention Network Relational Reasoning (GAT2R) model is proposed to address the challenges brought by the diversity of questions in VQA. The model includes scene graph generation and scene graph answer prediction to extract spatial features and predict relations between objects. Experimental results show that the proposed model achieves improved accuracy on GQA and VQA2.0 datasets, validating its effectiveness and generalization.

Due to the diversity of questions in VQA, it brings new challenges to the construction of VQA model. Existing VQA models focus on constructing a new attention mechanism, which makes the model increasingly complex. In addition, most of them concentrate on object recognition, ignore the research on spatial reasoning, semantic relations and even scene understanding. Therefore, a Graph Attention Network Relational Reasoning (GAT2R) model is proposed in this paper, which mainly includes scene graph generation and scene graph answer prediction. The scene map generation module mainly extracts the regional and spatial features of objects through the object detection model, and uses the relation decoder to predict the relations between object pairs. The scene graph answer prediction dynamically updates the node representation through the question-guided graph attention network, then performs multi-modal fusion with the question features, the answer is obtained finally. The experimental result shows that the accuracy of the proposed model is 54.45% on the natural scene dataset GQA, which is mainly based on relational reasoning. The experimental result is 68.04% on the widely used VQA2.0 dataset. Compared with the benchmark model, the accuracy of the proposed model is increased by 4.71% and 2.37% on GQA and VQA2.0 respectively, which proves the effectiveness and generalization of the model.

Research on Visual Question Answering Based on GAT Relational Reasoning

Journal

NEURAL PROCESSING LETTERS

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Research on Visual Question Answering Based on GAT Relational Reasoning

Journal

NEURAL PROCESSING LETTERS

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper