☆ 4.8 Article

Visual Grounding Via Accumulated Attention

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2022)

Journal

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Volume 44, Issue 3, Pages 1670-1684

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TPAMI.2020.3023438

Keywords

Proposals; Visualization; Training; Feature extraction; Task analysis; Grounding; Cognition; Visual grounding; accumulated attention; noised training strategy; bounding box regression

Funding

Science and Technology Program of Guangzhou, China [202007030007]
National Natural Science Foundation of China [61876121, 61836003]
Program for Guangdong Introducing Innovative and Enterpreneurial Teams [2017ZT07X183]
Fundamental Research Funds for the Central Universities [D2191240]
[DE190100539]
Australian Research Council [DE190100539] Funding Source: Australian Research Council

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Visual grounding aims to locate the most relevant object or region in an image based on natural language queries. This paper proposes an attention module to reduce internal redundancies and an accumulated attention mechanism to capture the relationship among different kinds of information. Additionally, noise is introduced to bridge the distribution gap between human-labeled training data and real-world poor quality data, improving the performance and robustness of the VG models. Experimental results demonstrate the superiority of the proposed methods on various datasets in terms of accuracy.

Visual grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. Generally, it requires the machine to first understand the query, identify the key concepts in the image, and then locate the target object by specifying its bounding box. However, in many real-world visual grounding applications, we have to face with ambiguous queries and images with complicated scene structures. Identifying the target based on highly redundant and correlated information can be very challenging, and often leading to unsatisfactory performance. To tackle this, in this paper, we exploit an attention module for each kind of information to reduce internal redundancies. We then propose an accumulated attention (A-ATT) mechanism to reason among all the attention modules jointly. In this way, the relation among different kinds of information can be explicitly captured. Moreover, to improve the performance and robustness of our VG models, we additionally introduce some noises into the training procedure to bridge the distribution gap between the human-labeled training data and the real-world poor quality data. With this noised training strategy, we can further learn a bounding box regressor, which can be used to refine the bounding box of the target object. We evaluate the proposed methods on four popular datasets (namely ReferCOCO, ReferCOCO+, ReferCOCOg, and GuessWhat?!). The experimental results show that our methods significantly outperform all previous works on every dataset in terms of accuracy.

Visual Grounding Via Accumulated Attention

Journal

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Visual Grounding Via Accumulated Attention

Journal

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper