☆ 4.7 Article

A survey of methods, datasets and evaluation metrics for visual question answering

IMAGE AND VISION COMPUTING (2021)

Journal

IMAGE AND VISION COMPUTING

Volume 116, Issue -, Pages -

Publisher

ELSEVIER

DOI: 10.1016/j.imavis.2021.104327

Keywords

Computer vision; Natural language processing; Deep neural networks; World knowledge; Attention

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Visual Question Answering (VQA) is a challenging research problem that combines computer vision and natural language processing. Researchers need to leverage common sense reasoning, image information, and world knowledge to provide accurate answers. In addition to traditional models, new VQA models and evaluation metrics are continuously being developed to improve performance.

Visual Question Answering (VQA) is a multi-disciplinary research problem that has captured the attention of both computer vision as well as natural language processing researchers. In Visual Question Answering, a system is given an image; a question in a natural language related to that image as an input, and the VQA system is required to give an answer in natural language as an output. A VQA algorithm may require common sense reasoning over the information contained in the image and world knowledge to produce the right answer. In this paper, we have discussed some of the core concepts used in VQA systems and present a comprehensive survey of efforts in the past to address this problem. Apart from traditional VQA models, we have also discussed visual question answering models that require reading texts present in images and evaluated on recently developed datasets like TextVQA, ST-VQA, and OCR-VQA. Apart from standard datasets discussed in previous surveys, we have also discussed some new datasets developed in 2019 and 2020 such as GQA, OK-VQA, TextVQA, ST-VQA, and OCR-VQA. The new evaluation metrics such as BLEU, MPT, METEOR, Average Normalized Levenshtein Similarity (ANLS), Validity, Plausibility, Distribution, Consistency, Grounding, F1-Score are explained together with the evaluation metrics discussed by previous surveys. We conclude our survey with a discussion on open issues in each phase of the VQA task and present some promising future directions. (c) 2021 Elsevier B.V. All rights reserved.

A survey of methods, datasets and evaluation metrics for visual question answering

Journal

IMAGE AND VISION COMPUTING

Publisher

ELSEVIER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A survey of methods, datasets and evaluation metrics for visual question answering

Journal

IMAGE AND VISION COMPUTING

Publisher

ELSEVIER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper