☆ 4.6 Article

Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism

ELECTRONICS (2022)

Journal

ELECTRONICS

Volume 11, Issue 11, Pages -

Publisher

MDPI

DOI: 10.3390/electronics11111778

Keywords

multi-modal alignment; multi-hop attention; visual question answering; feature fusion

Funding

National Natural Science Foundation of China [61771197]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

The alignment of information between image and question is crucial in visual question answering. This paper proposes a multi-hop attention alignment method and a position embedding mechanism to enrich attention weights and utilize position information, resulting in improved model performance.

The alignment of information between the image and the question is of great significance in the visual question answering (VQA) task. Self-attention is commonly used to generate attention weights between image and question. These attention weights can align two modalities. Through the attention weight, the model can select the relevant area of the image to align with the question. However, when using the self-attention mechanism, the attention weight between two objects is only determined by the representation of these two objects. It ignores the influence of other objects around these two objects. This contribution proposes a novel multi-hop attention alignment method that enriches surrounding information when using self-attention to align two modalities. Simultaneously, in order to utilize position information in alignment, we also propose a position embedding mechanism. The position embedding mechanism extracts the position information of each object and implements the position embedding mechanism to align the question word with the correct position in the image. According to the experiment on the VQA2.0 dataset, our model achieves validation accuracy of 65.77%, outperforming several state-of-the-art methods. The experimental result shows that our proposed methods have better performance and effectiveness.

Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism

Journal

ELECTRONICS

Publisher

MDPI

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism

Journal

ELECTRONICS

Publisher

MDPI

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper