4.7 Article

Explanation vs. attention: A two-player game to obtain attention for VQA and visual dialog

Journal

PATTERN RECOGNITION
Volume 132, Issue -, Pages -

Publisher

ELSEVIER SCI LTD
DOI: 10.1016/j.patcog.2022.108898

Keywords

CNN; LSTM; Explanation; Attention; Grad-CAM; MMD; CORAL; GAN; VQA; Visual Dialog; Deep learning

Ask authors/readers for more resources

In this paper, the authors propose an improved attention mechanism for visual question answering (VQA) task by using adversarial training. They introduce a new form of supervision for attention maps and demonstrate that it can result in attention maps that are more closely related to human attention, leading to substantial improvements in VQA performance. The proposed method also shows consistent improvement when combined with other existing methods.
In this paper, we aim to obtain improved attention for a visual question answering (VQA) task. It is challenging to provide supervision for attention. An observation we make is that visual explanations as obtained through class activation mappings (specifically Grad-CAM) that are meant to explain the perfor-mance of various networks could form a means of supervision. However, as the distributions of attention maps and that of Grad-CAMs differ, it would not be suitable to directly use these as a form of supervision. Rather, we propose the use of a discriminator that aims to distinguish samples of visual explanation and attention maps. The use of adversarial training of the attention regions as a two-player game between at-tention and explanation serves to bring the distributions of attention maps and visual explanations closer. Significantly, we observe that providing such a means of supervision also results in attention maps that are more closely related to human attention resulting in a substantial improvement over baseline stacked attention network (SAN) models. It also results in a good improvement in rank correlation metric on the VQA task. This method can also be combined with recent MCB based methods and results in consistent improvement. We also provide comparisons with other means for learning distributions such as based on Correlation Alignment (Coral), Maximum Mean Discrepancy (MMD) and Mean Square Error (MSE) losses and observe that the adversarial loss outperforms the other forms of learning the attention maps. A gen-eralization of the work is also provided by extending our approach to the task of 'Visual Dialog' where the attention is more contextual. Thorough evaluation for this task is also provided. Visualization of the results confirms our hypothesis that attention maps improve using the proposed form of supervision.(c) 2022 Elsevier Ltd. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available