☆ 4.7 Article

Reasoning on the Relation: Enhancing Visual Representation for Visual Question Answering and Cross-Modal Retrieval

IEEE TRANSACTIONS ON MULTIMEDIA (2020)

Journal

IEEE TRANSACTIONS ON MULTIMEDIA

Volume 22, Issue 12, Pages 3196-3209

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TMM.2020.2972830

Keywords

Visualization; Cognition; Task analysis; Knowledge discovery; Semantics; Correlation; Information retrieval; Visual relational reasoning; visual attention; visual question answering; cross-modal information retrieval

Categories

Computer Science, Information Systems Computer Science, Software Engineering Telecommunications

Funding

National Key Research and Development Program [2017YFB0803301]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Cross-modal analysis has become a promising direction for artificial intelligence. Visual representation is crucial for various cross-modal analysis tasks that require visual content understanding. Visual features which contain semantical information can disentangle the underlying correlation between different modalities, thus benefiting the downstream tasks. In this paper, we propose a Visual Reasoning and Attention Network (VRANet) as a plug-and-play module to capture rich visual semantics and help to enhance the visual representation for improving cross-modal analysis. Our proposed VRANet is built based on the bilinear visual attention module which identifies the critical objects. We propose a novel Visual Relational Reasoning (VRR) module to reason about pair-wise and inner-group visual relationships among objects guided by the textual information. The two modules enhance the visual features at both relation level and object level. We demonstrate the effectiveness of the proposed VRANet by applying it to both Visual Question Answering (VQA) and Cross-Modal Information Retrieval (CMIR) tasks. Extensive experiments conducted on VQA 2.0, CLEVR, CMPlaces, and MS-COCO datasets indicate superior performance comparing with state-of-the-art work.

Authors

I am an author on this paper

Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7

Not enough ratings

Secondary Ratings

Novelty

-

Significance

-

Scientific rigor

-

Rate this paper

Recommended

Article Computer Science, Artificial Intelligence

Comprehensive-perception dynamic reasoning for visual question answering

Kai Shuang, Jinyu Guo, Zihan Wang

Summary: The goal of Visual Question Answering (VQA) is to answer questions based on an image. Reasoning plays a crucial role in dealing with relations in the VQA task, as it requires modeling complex features. Existing models typically extract and integrate features only between adjacent layers, which may affect the integrity of information interaction. This paper proposes a comprehensive-perception dynamic reasoning (CPDR) model that utilizes cross-layer object features for multi-step compound reasoning, achieving superior performance and bringing considerable improvements when incorporated into VLP models.

PATTERN RECOGNITION (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

DisAVR: Disentangled Adaptive Visual Reasoning Network for Diagram Question Answering

Yaxian Wang, Bifan Wei, Jun Liu, Lingling Zhang, Jiaxin Wang, Qianying Wang

Summary: This paper proposes a Disentangled Adaptive Visual Reasoning Network (DisAVR) for Diagram Question Answering (DQA), which addresses the challenges of diagram representation and reasoning. DisAVR consists of improved region feature learning, question parsing, and disentangled adaptive reasoning modules. Experimental results demonstrate the superiority of DisAVR.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2023)

Add to Collection

Article Geochemistry & Geophysics

A Spatial Hierarchical Reasoning Network for Remote Sensing Visual Question Answering

Zixiao Zhang, Licheng Jiao, Lingling Li, Xu Liu, Puhua Chen, Fang Liu, Yuxuan Li, Zhicheng Guo

Summary: In this article, a novel method called spatial hierarchical reasoning network (SHRNet) is proposed to address the limitations of current methods in remote sensing visual question answering (RSVQA). The method enhances the visual-spatial reasoning capability and considers geospatial objects with large-scale differences and positional sensitive properties. Modeling and reasoning the relationships between entities are also explored for accurate answer predictions.

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING (2023)

Add to Collection

Article Computer Science, Information Systems

Depth-Aware and Semantic Guided Relational Attention Network for Visual Question Answering

Yuhang Liu, Wei Wei, Daowan Peng, Xian-Ling Mao, Zhiyong He, Pan Zhou

Summary: The researchers found that previous visual relationship understanding models have problems in accurate reasoning, so they proposed a new model called DSGANet. This model models the relationship between objects in three-dimensional space and explicitly aligns the relationships to address the deficiencies in existing models. The experiments show that DSGANet achieves competitive performance on multiple benchmark datasets.

IEEE TRANSACTIONS ON MULTIMEDIA (2023)

Add to Collection

Article Computer Science, Artificial Intelligence

Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

Jianjian Cao, Xiameng Qin, Sanyuan Zhao, Jianbing Shen

Summary: This article proposes a graph matching attention (GMA) network to address the challenges of answering semantically complicated questions in visual question answering (VQA) tasks. The network builds graphs for both the image and the question, and utilizes a dual-stage graph encoder and bilateral cross-modality GMA to infer the relationships between them. The updated cross-modality features are then used for final answer prediction. Experimental results show that the network achieves state-of-the-art performance on GQA and VQA 2.0 datasets, and ablation studies verify the effectiveness of each module in the GMA network.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (2022)

Add to Collection

Article Geochemistry & Geophysics

Change Detection Meets Visual Question Answering

Zhenghang Yuan, Lichao Mou, Zhitong Xiong, Xiao Xiang Zhu

Summary: The detection of changes on the Earth's surface is crucial for urban planning and sustainability. However, current change detection techniques are only accessible to experts. To address this, the study introduces a new task called change detection-based visual question answering (CDVQA) on multitemporal aerial images, enabling users to obtain change-based information easily. The study presents a CDVQA dataset and a baseline framework along with different strategies for improving the performance of the CDVQA task. The results offer valuable insights for future CDVQA research.

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING (2022)

Add to Collection

Article Computer Science, Information Systems

Positional Attention Guided Transformer-Like Architecture for Visual Question Answering

Aihua Mao, Zhi Yang, Ken Lin, Jun Xuan, Yong-Jin Liu

Summary: This paper introduces a novel positional attention guided Transformer-like architecture to address the challenge of utilizing positional information in visual question answering (VQA) tasks. Experimental results demonstrate that the proposed model outperforms state-of-the-art models and performs particularly well in handling object counting questions.

IEEE TRANSACTIONS ON MULTIMEDIA (2023)

Add to Collection

Article Computer Science, Artificial Intelligence

Adversarial Learning With Multi-Modal Attention for Visual Question Answering

Yun Liu, Xiaoming Zhang, Feiran Huang, Lei Cheng, Zhoujun Li

Summary: A novel visual question answering model, named ALMA, is proposed in this study, utilizing adversarial learning and multi-modal attention to learn a more effective joint representation of question-image pairs, outperforming existing methods.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (2021)

Add to Collection

Article Computer Science, Artificial Intelligence

Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering

Yang Liu, Guanbin Li, Liang Lin

Summary: This study proposes a framework for cross-modal causal relational reasoning to address the limitations of existing visual question answering methods. By introducing causal intervention operations and combining modules such as causality-aware visual-linguistic reasoning, spatial-temporal transformer, and visual-linguistic feature fusion, the framework is able to discover visual-linguistic causal structures and achieve robust event-level visual question answering.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2023)

Add to Collection

Article Computer Science, Artificial Intelligence

Video Question Answering With Prior Knowledge and Object-Sensitive Learning

Pengpeng Zeng, Haonan Zhang, Lianli Gao, Jingkuan Song, Heng Tao Shen

Summary: This paper addresses the challenges of utilizing prior knowledge and structured visual information in Video Question Answering (VideoQA). The proposed Prior Knowledge and Object-sensitive Learning (PKOL) approach effectively integrates prior knowledge and learns object-sensitive representations to enhance the VideoQA task. The experiments demonstrate consistent improvements and state-of-the-art performance on competitive benchmarks.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2022)

Add to Collection

Article Computer Science, Information Systems

Achieving Human Parity on Visual Question Answering

Ming Yan, Haiyang Xu, Chenliang Li, Junfeng Tian, Bin Bi, Wei Wang, Xianzhe Xu, Ji Zhang, Songfang Huang, Fei Huang, Luo Si, Rong Jin

Summary: This paper introduces a novel hierarchical integration of vision and language for Visual Question Answering (VQA) task, achieving similar or even slightly better results than a human being. A hierarchical framework is proposed to tackle practical problems in VQA, including diverse visual semantics learning, enhanced multi-modal pre-training, and knowledge-guided model integration. Treating different types of visual questions with corresponding expertise plays an important role in boosting the performance of the VQA architecture.

ACM TRANSACTIONS ON INFORMATION SYSTEMS (2023)

Add to Collection

Article Computer Science, Information Systems

Co-attention graph convolutional network for visual question answering

Chuan Liu, Ying-Ying Tan, Tian-Tian Xia, Jiajing Zhang, Ming Zhu

Summary: In this work, a combination of graph convolutional network and co-attention network is proposed to address the limitations of traditional visual attention models in reasoning relationships and multimodal interactions. The model utilizes binary relational reasoning as the graph learner module to capture relationships between visual objects and learns image representation related to specific questions with spatial awareness. Experimental results show that the proposed model achieves an overall accuracy of 68.67% on the test-std set of the benchmark VQA v2.0 dataset, outperforming most existing models.

MULTIMEDIA SYSTEMS (2023)

Add to Collection

Article Computer Science, Software Engineering

VisRecall: Quantifying Information Visualisation Recallability via Question Answering

Yao Wang, Chuhan Jiao, Mihai Bace, Andreas Bulling

Summary: This study proposes a question-answering paradigm to investigate the recallability of visualizations and creates a new dataset called VisRecall, which includes visualizations annotated with recallability scores from crowd-sourced human participants. Additionally, the study introduces a computational method for predicting the recallability of different visualization elements and demonstrates its effectiveness on the VisRecall dataset.

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS (2022)

Add to Collection

Article Computer Science, Information Systems

A question-guided multi-hop reasoning graph network for visual question answering

Zhaoyang Xu, Jinguang Gu, Maofu Liu, Guangyou Zhou, Haidong Fu, Chen Qiu

Summary: This paper investigates the potential of reasoning graph network on multi-hop reasoning questions. By constructing a cross-modal interaction module and a multi-hop reasoning graph network, the model dynamically updates the inter-associated instruction between two modalities to infer an answer. The experiments show that graph-based multi-hop reasoning improves visual question answering tasks significantly.

INFORMATION PROCESSING & MANAGEMENT (2023)

Add to Collection

Article Engineering, Electrical & Electronic

Action-Centric Relation Transformer Network for Video Question Answering

Jipeng Zhang, Jie Shao, Rui Cao, Lianli Gao, Xing Xu, Heng Tao Shen

Summary: Video question answering (VideoQA) is a popular research topic that has received a lot of attention in recent years. Researchers have focused on fusion strategies and feature preparation, but little attention has been given to incorporating actions of interest and exploring frame-to-frame relations. This study introduces an action-centric relation transformer network (ACRTransformer) that addresses these issues and demonstrates superior performance over previous state-of-the-art models.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2022)

Add to Collection

Article Engineering, Electrical & Electronic

Language-Guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning

Weixia Zhang, Chao Ma, Qi Wu, Xiaokang Yang

Summary: The main challenges of the emerging vision-and-language navigation (VLN) problem arise from the combination of language instructions and visual environments, as well as the discrepancy in action selection between training and inference.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2021)

Add to Collection

Article Computer Science, Information Systems

A Proposal-Free One-Stage Framework for Referring Expression Comprehension and Generation via Dense Cross-Attention

Mengyang Sun, Wei Suo, Peng Wang, Yanning Zhang, Qi Wu

Summary: This paper presents a proposal-free one-stage (PFOS) framework that can directly regress the region-of-interest from the image or generate unambiguous descriptions in an end-to-end manner. By taking the dense-grid of images as input and using a cross-attention transformer, the model learns multi-modal correspondences and eliminates the need for additional annotations or off-the-shelf detectors in the mainstream two-stage methods. Furthermore, the traditional two-stage listener-speaker framework is expanded to be jointly trained by a one-stage learning paradigm, resulting in state-of-the-art performance on comprehension and competitive results for generation.

IEEE TRANSACTIONS ON MULTIMEDIA (2023)

Add to Collection

Article Computer Science, Artificial Intelligence

Visual Grounding Via Accumulated Attention

Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, Mingkui Tan

Summary: Visual grounding aims to locate the most relevant object or region in an image based on natural language queries. This paper proposes an attention module to reduce internal redundancies and an accumulated attention mechanism to capture the relationship among different kinds of information. Additionally, noise is introduced to bridge the distribution gap between human-labeled training data and real-world poor quality data, improving the performance and robustness of the VG models. Experimental results demonstrate the superiority of the proposed methods on various datasets in terms of accuracy.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

Rethinking and Improving Feature Pyramids for One-Stage Referring Expression Comprehension

Wei Suo, Mengyang Sun, Peng Wang, Yanning Zhang, Qi Wu

Summary: Referring Expression Comprehension (REC) is a crucial task in the vision-and-language community, and it plays a vital role in various cross-modal tasks. Existing research focuses on a one-stage paradigm, treating REC as a language-conditioned object detection task to achieve a balance between speed and accuracy. However, previous frameworks overlook the importance of integrating multi-level features and often rely on single-scale features for target localization.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2023)

Add to Collection

Article Computer Science, Artificial Intelligence

HOP plus : History-Enhanced and Order-Aware Pre-Training for Vision-and-Language Navigation

Yanyuan Qiao, Yuankai Qi, Yicong Hong, Zheng Yu, Peng Wang, Qi Wu

Summary: This paper proposes an enhanced and history-aware pre-training method for Vision-and-Language Navigation (VLN), which introduces three novel VLN-specific proxy tasks and a memory network to improve historical knowledge learning and action prediction. The proposed method achieves new state-of-the-art performance on four downstream VLN tasks.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2023)

Add to Collection

Article Engineering, Electrical & Electronic

Multi-Granularity Aggregation Transformer for Joint Video-Audio-Text Representation Learning

Mengge He, Wenjing Du, Zhiquan Wen, Qing Du, Yutong Xie, Qi Wu

Summary: In this paper, a Multi-Granularity Aggregation Transformer (MGAT) is proposed for joint video-audio-text representation learning. The method overcomes the limitations of existing methods by designing a multi-granularity transformer module and an attention-guided aggregation module. The aggregated information is aligned with text information at different hierarchical levels using consistency loss and contrastive loss. Experimental results demonstrate the superiority of the proposed method on tasks such as video-paragraph retrieval and video captioning.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2023)

Add to Collection

Article Computer Science, Information Systems

Attention is Needed for RF Fingerprinting

Hanqing Gu, Lisheng Su, Weifeng Zhang, Chuan Ran

Summary: This paper proposes a novel Dual Attention Convolution module to learn robust RF fingerprints, improving the performance of convolutional neural networks on RF fingerprinting.

IEEE ACCESS (2023)

Add to Collection

Article Computer Science, Artificial Intelligence

Weakly-Supervised 3D Spatial Reasoning for Text-Based Visual Question Answering

Hao Li, Jinfa Huang, Peng Jin, Guoli Song, Qi Wu, Jie Chen

Summary: TextVQA aims to produce correct answers for questions about images with multiple scene texts. This paper introduces 3D geometric information into the spatial reasoning process to capture contextual knowledge. Experimental results show that the proposed method achieves state-of-the-art performance on TextVQA and ST-VQA datasets.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2023)

Add to Collection

Article Computer Science, Cybernetics

Data Hiding With Deep Learning: A Survey Unifying Digital Watermarking and Steganography

Zihan Wang, Olivia Byrnes, Hu Wang, Ruoxi Sun, Congbo Ma, Huaming Chen, Qi Wu, Minhui Xue

Summary: The use of deep learning techniques in data hiding has greatly advanced secure communication and identity verification fields. Digital watermarking and steganography techniques, by embedding information into noise-tolerant signals like audio, video, or images, can protect sensitive intellectual property (IP) and enable confidential communication for authorized parties. This survey provides a systematic overview of recent developments in deep learning techniques for data hiding, based on model architectures and noise injection methods. It also suggests and discusses potential future research directions that combine digital watermarking and steganography in software engineering to enhance security and mitigate risks. This contribution promotes the creation of a more trustworthy digital world and advances responsible artificial intelligence (AI).

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS (2023)

Add to Collection

Article Computer Science, Information Systems

Show, Price and Negotiate: A Negotiator With Online Value Look-Ahead

Amin Parvaneh, Ehsan Abbasnejad, Qi Wu, Javen Qinfeng Shi, Anton van den Hengel

Summary: This study proposes a modular deep neural network called Price Negotiator to improve negotiation in online shopping. It addresses the challenges by considering item images, finding similar items, predicting price actions, and adjusting prices based on predicted actions.

IEEE TRANSACTIONS ON MULTIMEDIA (2022)

Add to Collection

Article Computer Science, Information Systems

Co-LDL: A Co-Training-Based Label Distribution Learning Method for Tackling Label Noise

Zeren Sun, Huafeng Liu, Qiong Wang, Tianfei Zhou, Qi Wu, Zhenmin Tang

Summary: This paper proposes an end-to-end framework named Co-LDL for addressing the performance degradation of deep neural networks caused by label noise. The framework incorporates the low-loss sample selection strategy with label distribution learning and trains two deep neural networks simultaneously to communicate useful knowledge. Additionally, a self-supervised module is introduced to enhance the learned representations.

IEEE TRANSACTIONS ON MULTIMEDIA (2022)

Add to Collection

Article Computer Science, Information Systems

Robust Learning From Noisy Web Images Via Data Purification for Fine-Grained Recognition

Chuanyi Zhang, Qiong Wang, Guosen Xie, Qi Wu, Fumin Shen, Zhenmin Tang

Summary: This article introduces a method for learning fine-grained tasks from web data, which purifies noisy training sets by identifying and distinguishing noisy images, and trains models to alleviate the effects of noise.

IEEE TRANSACTIONS ON MULTIMEDIA (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

Multi-Intersection Traffic Optimisation: A Benchmark Dataset and a Strong Baseline

Hu Wang, Hao Chen, Qi Wu, Congbo Ma, Yidong Li

Summary: The control of traffic signals is crucial in relieving traffic congestion in urban areas. However, it is difficult due to the complexity of real-world traffic dynamics. To address this, the researchers propose a new dataset and a novel model based on deep reinforcement learning for optimizing multi-intersection traffic control. The experimental results show that the proposed model outperforms other methods.

IEEE OPEN JOURNAL OF INTELLIGENT TRANSPORTATION SYSTEMS (2022)

Add to Collection

Proceedings Paper Computer Science, Artificial Intelligence

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps

Qi Zhu, Chenyu Gao, Peng Wang, Qi Wu

Summary: This paper discusses the advantages of a simple attention mechanism in OCR text-related tasks, splitting OCR features into visual and linguistic attention branches and sending them to a Transformer decoder to generate answers or captions. The baseline model performs strongly, outperforming state-of-the-art models on two popular benchmarks and surpassing the TextCaps Challenge 2020 winner.

THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE (2021)

Add to Collection

Proceedings Paper Computer Science, Artificial Intelligence

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

Zhaokai Wang, Renda Bao, Qi Wu, Si Liu

Summary: Reading text in the visual scene is crucial to understanding key information when describing an image. This study introduces a Confidence-aware Non-repetitive Multimodal Transformer (CNMT) to read OCR tokens and generate accurate descriptions.

THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE (2021)

Add to Collection

No Data Available

© Peeref 2019-2024. All rights reserved.