☆ 4.7 Article

Mutual Attention Inception Network for Remote Sensing Visual Question Answering

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING (2022)

Journal

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

Volume 60, Issue -, Pages -

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TGRS.2021.3079918

Keywords

Task analysis; Remote sensing; Visualization; Knowledge discovery; Semantics; Object detection; Feature extraction; Attention mechanism; feature fusion; remote sensing visual question answering (RSVQA); semantic understanding

Funding

National Science Fund for Distinguished Young Scholars [61925112]
National Natural Science Foundation of China [61806193, 61772510]
Innovation Capability Support Program of Shaanxi [2020KJXX-091, 2020TD-015]
Key Research and Development Program of Shaanxi [2020ZDLGY04-03]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This study introduces a method for remote sensing visual question answering (VQA) that considers the fusion of image features and question features, introducing convolutional features and word vectors, as well as attention mechanism and bilinear technique. Experimental results demonstrate that the proposed method can capture the alignments between images and questions.

Remote sensing images (RSIs) containing various ground objects have been applied in many fields. To make semantic understanding of RSIs objective and interactive, the task remote sensing visual question answering (VQA) has appeared. Given an RSI, the goal of remote sensing VQA is to make an intelligent agent answer a question about the remote sensing scene. Existing remote sensing VQA methods utilized a nonspatial fusion strategy to fuse the image features and question features, which ignores the spatial information of images and word-level information of questions. A novel method is proposed to complete the task considering these two aspects. First, convolutional features of the image are included to represent spatial information, and the word vectors of questions are adopted to present semantic word information. Second, attention mechanism and bilinear technique are introduced to enhance the feature considering the alignments between spatial positions and words. Finally, a fully connected layer with softmax is utilized to output an answer from the perspective of the multiclass classification task. To benchmark this task, a RSIVQA dataset is introduced in this article. For each of more than 37,000 RSIs, the proposed dataset contains at least one or more questions, plus corresponding answers. Experimental results demonstrate that the proposed method can capture the alignments between images and questions. The code and dataset are available at https://github.com/spectralpublic/RSIVQA.

Authors

I am an author on this paper

Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7

Not enough ratings

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

A Spatial Hierarchical Reasoning Network for Remote Sensing Visual Question Answering

Zixiao Zhang, Licheng Jiao, Lingling Li, Xu Liu, Puhua Chen, Fang Liu, Yuxuan Li, Zhicheng Guo

Summary: In this article, a novel method called spatial hierarchical reasoning network (SHRNet) is proposed to address the limitations of current methods in remote sensing visual question answering (RSVQA). The method enhances the visual-spatial reasoning capability and considers geospatial objects with large-scale differences and positional sensitive properties. Modeling and reasoning the relationships between entities are also explored for accurate answer predictions.

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING (2023)