☆ 4.6 Article

A Bi-level representation learning model for medical visual question answering

JOURNAL OF BIOMEDICAL INFORMATICS (2022)

Journal

JOURNAL OF BIOMEDICAL INFORMATICS

Volume 134, Issue -, Pages -

Publisher

ACADEMIC PRESS INC ELSEVIER SCIENCE

DOI: 10.1016/j.jbi.2022.104183

Keywords

Medical visual question answering; Token-level reasoning; Sentence-level reasoning; Label-distribution-smooth margin loss

Funding

National Natural Science Foundation of China
Natural Science Foundation of Guangdong Province
Research and Development Projects in Key Areas of Guangdong Province
Collaborative Innovation Team of Guangzhou University of Traditional Chinese Medicine
[61871141]
[2021A1515011339]
[2021A1111120008]
[2021XK08]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Medical Visual Question Answering (VQA) is a research field that aims to answer questions related to medical images, with significant potential in healthcare services. However, the research on medical VQA still faces challenges, particularly in learning fine-grained multimodal semantic representations from limited data resources for accurate answer prediction.

Medical Visual Question Answering (VQA) targets at answering questions related to given medical images and it contains tremendous potential in healthcare services. However, researches on medical VQA are still facing challenges, particularly on how to learn a fine-grained multimodal semantic representation from relatively small volume of data resources for answer prediction. Moreover, the long-tailed distribution labels of medical VQA data frequently result in poor performance of models. To this end, we propose a novel bi-level representation learning model with two reasoning modules to learn bi-level representations for the medical VQA task. One is sentence-level reasoning to learn sentence-level semantic representations from multimodal input. The other is token-level reasoning that employs an attention mechanism to generate a multimodal contextual vector by fusing image features and word embeddings. The contextual vector is used to filter irrelevant semantic representations from sentence-level reasoning to generate a fine-grained multimodal representation. Furthermore, a label -distribution-smooth margin loss is proposed to minimize generalization error bound of long-tailed distribution datasets by modifying margin bound of different labels in training set. Based on standard VQA-Rad dataset and PathVQA dataset, the proposed model achieves 0.7605 and 0.5434 on accuracy, 0.7741 and 0.5288 on F1-score, respectively, outperforming a set of state-of-the-art baseline models.

A Bi-level representation learning model for medical visual question answering

Journal

JOURNAL OF BIOMEDICAL INFORMATICS

Publisher

ACADEMIC PRESS INC ELSEVIER SCIENCE

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A Bi-level representation learning model for medical visual question answering

Journal

JOURNAL OF BIOMEDICAL INFORMATICS

Publisher

ACADEMIC PRESS INC ELSEVIER SCIENCE

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper