4.7 Article

Reasoning on the Relation: Enhancing Visual Representation for Visual Question Answering and Cross-Modal Retrieval

Journal

IEEE TRANSACTIONS ON MULTIMEDIA
Volume 22, Issue 12, Pages 3196-3209

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TMM.2020.2972830

Keywords

Visualization; Cognition; Task analysis; Knowledge discovery; Semantics; Correlation; Information retrieval; Visual relational reasoning; visual attention; visual question answering; cross-modal information retrieval

Funding

  1. National Key Research and Development Program [2017YFB0803301]

Ask authors/readers for more resources

Cross-modal analysis has become a promising direction for artificial intelligence. Visual representation is crucial for various cross-modal analysis tasks that require visual content understanding. Visual features which contain semantical information can disentangle the underlying correlation between different modalities, thus benefiting the downstream tasks. In this paper, we propose a Visual Reasoning and Attention Network (VRANet) as a plug-and-play module to capture rich visual semantics and help to enhance the visual representation for improving cross-modal analysis. Our proposed VRANet is built based on the bilinear visual attention module which identifies the critical objects. We propose a novel Visual Relational Reasoning (VRR) module to reason about pair-wise and inner-group visual relationships among objects guided by the textual information. The two modules enhance the visual features at both relation level and object level. We demonstrate the effectiveness of the proposed VRANet by applying it to both Visual Question Answering (VQA) and Cross-Modal Information Retrieval (CMIR) tasks. Extensive experiments conducted on VQA 2.0, CLEVR, CMPlaces, and MS-COCO datasets indicate superior performance comparing with state-of-the-art work.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

Article Engineering, Electrical & Electronic

Language-Guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning

Weixia Zhang, Chao Ma, Qi Wu, Xiaokang Yang

Summary: The main challenges of the emerging vision-and-language navigation (VLN) problem arise from the combination of language instructions and visual environments, as well as the discrepancy in action selection between training and inference.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2021)

Article Computer Science, Information Systems

A Proposal-Free One-Stage Framework for Referring Expression Comprehension and Generation via Dense Cross-Attention

Mengyang Sun, Wei Suo, Peng Wang, Yanning Zhang, Qi Wu

Summary: This paper presents a proposal-free one-stage (PFOS) framework that can directly regress the region-of-interest from the image or generate unambiguous descriptions in an end-to-end manner. By taking the dense-grid of images as input and using a cross-attention transformer, the model learns multi-modal correspondences and eliminates the need for additional annotations or off-the-shelf detectors in the mainstream two-stage methods. Furthermore, the traditional two-stage listener-speaker framework is expanded to be jointly trained by a one-stage learning paradigm, resulting in state-of-the-art performance on comprehension and competitive results for generation.

IEEE TRANSACTIONS ON MULTIMEDIA (2023)

Article Computer Science, Artificial Intelligence

Visual Grounding Via Accumulated Attention

Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, Mingkui Tan

Summary: Visual grounding aims to locate the most relevant object or region in an image based on natural language queries. This paper proposes an attention module to reduce internal redundancies and an accumulated attention mechanism to capture the relationship among different kinds of information. Additionally, noise is introduced to bridge the distribution gap between human-labeled training data and real-world poor quality data, improving the performance and robustness of the VG models. Experimental results demonstrate the superiority of the proposed methods on various datasets in terms of accuracy.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2022)

Article Computer Science, Artificial Intelligence

Rethinking and Improving Feature Pyramids for One-Stage Referring Expression Comprehension

Wei Suo, Mengyang Sun, Peng Wang, Yanning Zhang, Qi Wu

Summary: Referring Expression Comprehension (REC) is a crucial task in the vision-and-language community, and it plays a vital role in various cross-modal tasks. Existing research focuses on a one-stage paradigm, treating REC as a language-conditioned object detection task to achieve a balance between speed and accuracy. However, previous frameworks overlook the importance of integrating multi-level features and often rely on single-scale features for target localization.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2023)

Article Computer Science, Artificial Intelligence

HOP plus : History-Enhanced and Order-Aware Pre-Training for Vision-and-Language Navigation

Yanyuan Qiao, Yuankai Qi, Yicong Hong, Zheng Yu, Peng Wang, Qi Wu

Summary: This paper proposes an enhanced and history-aware pre-training method for Vision-and-Language Navigation (VLN), which introduces three novel VLN-specific proxy tasks and a memory network to improve historical knowledge learning and action prediction. The proposed method achieves new state-of-the-art performance on four downstream VLN tasks.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2023)

Article Engineering, Electrical & Electronic

Multi-Granularity Aggregation Transformer for Joint Video-Audio-Text Representation Learning

Mengge He, Wenjing Du, Zhiquan Wen, Qing Du, Yutong Xie, Qi Wu

Summary: In this paper, a Multi-Granularity Aggregation Transformer (MGAT) is proposed for joint video-audio-text representation learning. The method overcomes the limitations of existing methods by designing a multi-granularity transformer module and an attention-guided aggregation module. The aggregated information is aligned with text information at different hierarchical levels using consistency loss and contrastive loss. Experimental results demonstrate the superiority of the proposed method on tasks such as video-paragraph retrieval and video captioning.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2023)

Article Computer Science, Information Systems

Attention is Needed for RF Fingerprinting

Hanqing Gu, Lisheng Su, Weifeng Zhang, Chuan Ran

Summary: This paper proposes a novel Dual Attention Convolution module to learn robust RF fingerprints, improving the performance of convolutional neural networks on RF fingerprinting.

IEEE ACCESS (2023)

Article Computer Science, Artificial Intelligence

Weakly-Supervised 3D Spatial Reasoning for Text-Based Visual Question Answering

Hao Li, Jinfa Huang, Peng Jin, Guoli Song, Qi Wu, Jie Chen

Summary: TextVQA aims to produce correct answers for questions about images with multiple scene texts. This paper introduces 3D geometric information into the spatial reasoning process to capture contextual knowledge. Experimental results show that the proposed method achieves state-of-the-art performance on TextVQA and ST-VQA datasets.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2023)

Article Computer Science, Cybernetics

Data Hiding With Deep Learning: A Survey Unifying Digital Watermarking and Steganography

Zihan Wang, Olivia Byrnes, Hu Wang, Ruoxi Sun, Congbo Ma, Huaming Chen, Qi Wu, Minhui Xue

Summary: The use of deep learning techniques in data hiding has greatly advanced secure communication and identity verification fields. Digital watermarking and steganography techniques, by embedding information into noise-tolerant signals like audio, video, or images, can protect sensitive intellectual property (IP) and enable confidential communication for authorized parties. This survey provides a systematic overview of recent developments in deep learning techniques for data hiding, based on model architectures and noise injection methods. It also suggests and discusses potential future research directions that combine digital watermarking and steganography in software engineering to enhance security and mitigate risks. This contribution promotes the creation of a more trustworthy digital world and advances responsible artificial intelligence (AI).

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS (2023)

Article Computer Science, Information Systems

Show, Price and Negotiate: A Negotiator With Online Value Look-Ahead

Amin Parvaneh, Ehsan Abbasnejad, Qi Wu, Javen Qinfeng Shi, Anton van den Hengel

Summary: This study proposes a modular deep neural network called Price Negotiator to improve negotiation in online shopping. It addresses the challenges by considering item images, finding similar items, predicting price actions, and adjusting prices based on predicted actions.

IEEE TRANSACTIONS ON MULTIMEDIA (2022)

Article Computer Science, Information Systems

Co-LDL: A Co-Training-Based Label Distribution Learning Method for Tackling Label Noise

Zeren Sun, Huafeng Liu, Qiong Wang, Tianfei Zhou, Qi Wu, Zhenmin Tang

Summary: This paper proposes an end-to-end framework named Co-LDL for addressing the performance degradation of deep neural networks caused by label noise. The framework incorporates the low-loss sample selection strategy with label distribution learning and trains two deep neural networks simultaneously to communicate useful knowledge. Additionally, a self-supervised module is introduced to enhance the learned representations.

IEEE TRANSACTIONS ON MULTIMEDIA (2022)

Article Computer Science, Information Systems

Robust Learning From Noisy Web Images Via Data Purification for Fine-Grained Recognition

Chuanyi Zhang, Qiong Wang, Guosen Xie, Qi Wu, Fumin Shen, Zhenmin Tang

Summary: This article introduces a method for learning fine-grained tasks from web data, which purifies noisy training sets by identifying and distinguishing noisy images, and trains models to alleviate the effects of noise.

IEEE TRANSACTIONS ON MULTIMEDIA (2022)

Article Computer Science, Artificial Intelligence

Multi-Intersection Traffic Optimisation: A Benchmark Dataset and a Strong Baseline

Hu Wang, Hao Chen, Qi Wu, Congbo Ma, Yidong Li

Summary: The control of traffic signals is crucial in relieving traffic congestion in urban areas. However, it is difficult due to the complexity of real-world traffic dynamics. To address this, the researchers propose a new dataset and a novel model based on deep reinforcement learning for optimizing multi-intersection traffic control. The experimental results show that the proposed model outperforms other methods.

IEEE OPEN JOURNAL OF INTELLIGENT TRANSPORTATION SYSTEMS (2022)

Proceedings Paper Computer Science, Artificial Intelligence

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps

Qi Zhu, Chenyu Gao, Peng Wang, Qi Wu

Summary: This paper discusses the advantages of a simple attention mechanism in OCR text-related tasks, splitting OCR features into visual and linguistic attention branches and sending them to a Transformer decoder to generate answers or captions. The baseline model performs strongly, outperforming state-of-the-art models on two popular benchmarks and surpassing the TextCaps Challenge 2020 winner.

THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE (2021)

Proceedings Paper Computer Science, Artificial Intelligence

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

Zhaokai Wang, Renda Bao, Qi Wu, Si Liu

Summary: Reading text in the visual scene is crucial to understanding key information when describing an image. This study introduces a Confidence-aware Non-repetitive Multimodal Transformer (CNMT) to read OCR tokens and generate accurate descriptions.

THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE (2021)

No Data Available