Article
Computer Science, Artificial Intelligence
Kai Shuang, Jinyu Guo, Zihan Wang
Summary: The goal of Visual Question Answering (VQA) is to answer questions based on an image. Reasoning plays a crucial role in dealing with relations in the VQA task, as it requires modeling complex features. Existing models typically extract and integrate features only between adjacent layers, which may affect the integrity of information interaction. This paper proposes a comprehensive-perception dynamic reasoning (CPDR) model that utilizes cross-layer object features for multi-step compound reasoning, achieving superior performance and bringing considerable improvements when incorporated into VLP models.
PATTERN RECOGNITION
(2022)
Article
Computer Science, Artificial Intelligence
Yaxian Wang, Bifan Wei, Jun Liu, Lingling Zhang, Jiaxin Wang, Qianying Wang
Summary: This paper proposes a Disentangled Adaptive Visual Reasoning Network (DisAVR) for Diagram Question Answering (DQA), which addresses the challenges of diagram representation and reasoning. DisAVR consists of improved region feature learning, question parsing, and disentangled adaptive reasoning modules. Experimental results demonstrate the superiority of DisAVR.
IEEE TRANSACTIONS ON IMAGE PROCESSING
(2023)
Article
Geochemistry & Geophysics
Zixiao Zhang, Licheng Jiao, Lingling Li, Xu Liu, Puhua Chen, Fang Liu, Yuxuan Li, Zhicheng Guo
Summary: In this article, a novel method called spatial hierarchical reasoning network (SHRNet) is proposed to address the limitations of current methods in remote sensing visual question answering (RSVQA). The method enhances the visual-spatial reasoning capability and considers geospatial objects with large-scale differences and positional sensitive properties. Modeling and reasoning the relationships between entities are also explored for accurate answer predictions.
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
(2023)
Article
Computer Science, Information Systems
Yuhang Liu, Wei Wei, Daowan Peng, Xian-Ling Mao, Zhiyong He, Pan Zhou
Summary: The researchers found that previous visual relationship understanding models have problems in accurate reasoning, so they proposed a new model called DSGANet. This model models the relationship between objects in three-dimensional space and explicitly aligns the relationships to address the deficiencies in existing models. The experiments show that DSGANet achieves competitive performance on multiple benchmark datasets.
IEEE TRANSACTIONS ON MULTIMEDIA
(2023)
Article
Computer Science, Artificial Intelligence
Jianjian Cao, Xiameng Qin, Sanyuan Zhao, Jianbing Shen
Summary: This article proposes a graph matching attention (GMA) network to address the challenges of answering semantically complicated questions in visual question answering (VQA) tasks. The network builds graphs for both the image and the question, and utilizes a dual-stage graph encoder and bilateral cross-modality GMA to infer the relationships between them. The updated cross-modality features are then used for final answer prediction. Experimental results show that the network achieves state-of-the-art performance on GQA and VQA 2.0 datasets, and ablation studies verify the effectiveness of each module in the GMA network.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
(2022)
Article
Geochemistry & Geophysics
Zhenghang Yuan, Lichao Mou, Zhitong Xiong, Xiao Xiang Zhu
Summary: The detection of changes on the Earth's surface is crucial for urban planning and sustainability. However, current change detection techniques are only accessible to experts. To address this, the study introduces a new task called change detection-based visual question answering (CDVQA) on multitemporal aerial images, enabling users to obtain change-based information easily. The study presents a CDVQA dataset and a baseline framework along with different strategies for improving the performance of the CDVQA task. The results offer valuable insights for future CDVQA research.
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
(2022)
Article
Computer Science, Information Systems
Aihua Mao, Zhi Yang, Ken Lin, Jun Xuan, Yong-Jin Liu
Summary: This paper introduces a novel positional attention guided Transformer-like architecture to address the challenge of utilizing positional information in visual question answering (VQA) tasks. Experimental results demonstrate that the proposed model outperforms state-of-the-art models and performs particularly well in handling object counting questions.
IEEE TRANSACTIONS ON MULTIMEDIA
(2023)
Article
Computer Science, Artificial Intelligence
Yun Liu, Xiaoming Zhang, Feiran Huang, Lei Cheng, Zhoujun Li
Summary: A novel visual question answering model, named ALMA, is proposed in this study, utilizing adversarial learning and multi-modal attention to learn a more effective joint representation of question-image pairs, outperforming existing methods.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
(2021)
Article
Computer Science, Artificial Intelligence
Yang Liu, Guanbin Li, Liang Lin
Summary: This study proposes a framework for cross-modal causal relational reasoning to address the limitations of existing visual question answering methods. By introducing causal intervention operations and combining modules such as causality-aware visual-linguistic reasoning, spatial-temporal transformer, and visual-linguistic feature fusion, the framework is able to discover visual-linguistic causal structures and achieve robust event-level visual question answering.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
(2023)
Article
Computer Science, Artificial Intelligence
Pengpeng Zeng, Haonan Zhang, Lianli Gao, Jingkuan Song, Heng Tao Shen
Summary: This paper addresses the challenges of utilizing prior knowledge and structured visual information in Video Question Answering (VideoQA). The proposed Prior Knowledge and Object-sensitive Learning (PKOL) approach effectively integrates prior knowledge and learns object-sensitive representations to enhance the VideoQA task. The experiments demonstrate consistent improvements and state-of-the-art performance on competitive benchmarks.
IEEE TRANSACTIONS ON IMAGE PROCESSING
(2022)
Article
Computer Science, Information Systems
Ming Yan, Haiyang Xu, Chenliang Li, Junfeng Tian, Bin Bi, Wei Wang, Xianzhe Xu, Ji Zhang, Songfang Huang, Fei Huang, Luo Si, Rong Jin
Summary: This paper introduces a novel hierarchical integration of vision and language for Visual Question Answering (VQA) task, achieving similar or even slightly better results than a human being. A hierarchical framework is proposed to tackle practical problems in VQA, including diverse visual semantics learning, enhanced multi-modal pre-training, and knowledge-guided model integration. Treating different types of visual questions with corresponding expertise plays an important role in boosting the performance of the VQA architecture.
ACM TRANSACTIONS ON INFORMATION SYSTEMS
(2023)
Article
Computer Science, Information Systems
Chuan Liu, Ying-Ying Tan, Tian-Tian Xia, Jiajing Zhang, Ming Zhu
Summary: In this work, a combination of graph convolutional network and co-attention network is proposed to address the limitations of traditional visual attention models in reasoning relationships and multimodal interactions. The model utilizes binary relational reasoning as the graph learner module to capture relationships between visual objects and learns image representation related to specific questions with spatial awareness. Experimental results show that the proposed model achieves an overall accuracy of 68.67% on the test-std set of the benchmark VQA v2.0 dataset, outperforming most existing models.
MULTIMEDIA SYSTEMS
(2023)
Article
Computer Science, Software Engineering
Yao Wang, Chuhan Jiao, Mihai Bace, Andreas Bulling
Summary: This study proposes a question-answering paradigm to investigate the recallability of visualizations and creates a new dataset called VisRecall, which includes visualizations annotated with recallability scores from crowd-sourced human participants. Additionally, the study introduces a computational method for predicting the recallability of different visualization elements and demonstrates its effectiveness on the VisRecall dataset.
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS
(2022)
Article
Computer Science, Information Systems
Zhaoyang Xu, Jinguang Gu, Maofu Liu, Guangyou Zhou, Haidong Fu, Chen Qiu
Summary: This paper investigates the potential of reasoning graph network on multi-hop reasoning questions. By constructing a cross-modal interaction module and a multi-hop reasoning graph network, the model dynamically updates the inter-associated instruction between two modalities to infer an answer. The experiments show that graph-based multi-hop reasoning improves visual question answering tasks significantly.
INFORMATION PROCESSING & MANAGEMENT
(2023)
Article
Engineering, Electrical & Electronic
Jipeng Zhang, Jie Shao, Rui Cao, Lianli Gao, Xing Xu, Heng Tao Shen
Summary: Video question answering (VideoQA) is a popular research topic that has received a lot of attention in recent years. Researchers have focused on fusion strategies and feature preparation, but little attention has been given to incorporating actions of interest and exploring frame-to-frame relations. This study introduces an action-centric relation transformer network (ACRTransformer) that addresses these issues and demonstrates superior performance over previous state-of-the-art models.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
(2022)
Article
Engineering, Electrical & Electronic
Weixia Zhang, Chao Ma, Qi Wu, Xiaokang Yang
Summary: The main challenges of the emerging vision-and-language navigation (VLN) problem arise from the combination of language instructions and visual environments, as well as the discrepancy in action selection between training and inference.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
(2021)
Article
Computer Science, Information Systems
Mengyang Sun, Wei Suo, Peng Wang, Yanning Zhang, Qi Wu
Summary: This paper presents a proposal-free one-stage (PFOS) framework that can directly regress the region-of-interest from the image or generate unambiguous descriptions in an end-to-end manner. By taking the dense-grid of images as input and using a cross-attention transformer, the model learns multi-modal correspondences and eliminates the need for additional annotations or off-the-shelf detectors in the mainstream two-stage methods. Furthermore, the traditional two-stage listener-speaker framework is expanded to be jointly trained by a one-stage learning paradigm, resulting in state-of-the-art performance on comprehension and competitive results for generation.
IEEE TRANSACTIONS ON MULTIMEDIA
(2023)
Article
Computer Science, Artificial Intelligence
Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, Mingkui Tan
Summary: Visual grounding aims to locate the most relevant object or region in an image based on natural language queries. This paper proposes an attention module to reduce internal redundancies and an accumulated attention mechanism to capture the relationship among different kinds of information. Additionally, noise is introduced to bridge the distribution gap between human-labeled training data and real-world poor quality data, improving the performance and robustness of the VG models. Experimental results demonstrate the superiority of the proposed methods on various datasets in terms of accuracy.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
(2022)
Article
Computer Science, Artificial Intelligence
Wei Suo, Mengyang Sun, Peng Wang, Yanning Zhang, Qi Wu
Summary: Referring Expression Comprehension (REC) is a crucial task in the vision-and-language community, and it plays a vital role in various cross-modal tasks. Existing research focuses on a one-stage paradigm, treating REC as a language-conditioned object detection task to achieve a balance between speed and accuracy. However, previous frameworks overlook the importance of integrating multi-level features and often rely on single-scale features for target localization.
IEEE TRANSACTIONS ON IMAGE PROCESSING
(2023)
Article
Computer Science, Artificial Intelligence
Yanyuan Qiao, Yuankai Qi, Yicong Hong, Zheng Yu, Peng Wang, Qi Wu
Summary: This paper proposes an enhanced and history-aware pre-training method for Vision-and-Language Navigation (VLN), which introduces three novel VLN-specific proxy tasks and a memory network to improve historical knowledge learning and action prediction. The proposed method achieves new state-of-the-art performance on four downstream VLN tasks.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
(2023)
Article
Engineering, Electrical & Electronic
Mengge He, Wenjing Du, Zhiquan Wen, Qing Du, Yutong Xie, Qi Wu
Summary: In this paper, a Multi-Granularity Aggregation Transformer (MGAT) is proposed for joint video-audio-text representation learning. The method overcomes the limitations of existing methods by designing a multi-granularity transformer module and an attention-guided aggregation module. The aggregated information is aligned with text information at different hierarchical levels using consistency loss and contrastive loss. Experimental results demonstrate the superiority of the proposed method on tasks such as video-paragraph retrieval and video captioning.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
(2023)
Article
Computer Science, Information Systems
Hanqing Gu, Lisheng Su, Weifeng Zhang, Chuan Ran
Summary: This paper proposes a novel Dual Attention Convolution module to learn robust RF fingerprints, improving the performance of convolutional neural networks on RF fingerprinting.
Article
Computer Science, Artificial Intelligence
Hao Li, Jinfa Huang, Peng Jin, Guoli Song, Qi Wu, Jie Chen
Summary: TextVQA aims to produce correct answers for questions about images with multiple scene texts. This paper introduces 3D geometric information into the spatial reasoning process to capture contextual knowledge. Experimental results show that the proposed method achieves state-of-the-art performance on TextVQA and ST-VQA datasets.
IEEE TRANSACTIONS ON IMAGE PROCESSING
(2023)
Article
Computer Science, Cybernetics
Zihan Wang, Olivia Byrnes, Hu Wang, Ruoxi Sun, Congbo Ma, Huaming Chen, Qi Wu, Minhui Xue
Summary: The use of deep learning techniques in data hiding has greatly advanced secure communication and identity verification fields. Digital watermarking and steganography techniques, by embedding information into noise-tolerant signals like audio, video, or images, can protect sensitive intellectual property (IP) and enable confidential communication for authorized parties. This survey provides a systematic overview of recent developments in deep learning techniques for data hiding, based on model architectures and noise injection methods. It also suggests and discusses potential future research directions that combine digital watermarking and steganography in software engineering to enhance security and mitigate risks. This contribution promotes the creation of a more trustworthy digital world and advances responsible artificial intelligence (AI).
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS
(2023)
Article
Computer Science, Information Systems
Amin Parvaneh, Ehsan Abbasnejad, Qi Wu, Javen Qinfeng Shi, Anton van den Hengel
Summary: This study proposes a modular deep neural network called Price Negotiator to improve negotiation in online shopping. It addresses the challenges by considering item images, finding similar items, predicting price actions, and adjusting prices based on predicted actions.
IEEE TRANSACTIONS ON MULTIMEDIA
(2022)
Article
Computer Science, Information Systems
Zeren Sun, Huafeng Liu, Qiong Wang, Tianfei Zhou, Qi Wu, Zhenmin Tang
Summary: This paper proposes an end-to-end framework named Co-LDL for addressing the performance degradation of deep neural networks caused by label noise. The framework incorporates the low-loss sample selection strategy with label distribution learning and trains two deep neural networks simultaneously to communicate useful knowledge. Additionally, a self-supervised module is introduced to enhance the learned representations.
IEEE TRANSACTIONS ON MULTIMEDIA
(2022)
Article
Computer Science, Information Systems
Chuanyi Zhang, Qiong Wang, Guosen Xie, Qi Wu, Fumin Shen, Zhenmin Tang
Summary: This article introduces a method for learning fine-grained tasks from web data, which purifies noisy training sets by identifying and distinguishing noisy images, and trains models to alleviate the effects of noise.
IEEE TRANSACTIONS ON MULTIMEDIA
(2022)
Article
Computer Science, Artificial Intelligence
Hu Wang, Hao Chen, Qi Wu, Congbo Ma, Yidong Li
Summary: The control of traffic signals is crucial in relieving traffic congestion in urban areas. However, it is difficult due to the complexity of real-world traffic dynamics. To address this, the researchers propose a new dataset and a novel model based on deep reinforcement learning for optimizing multi-intersection traffic control. The experimental results show that the proposed model outperforms other methods.
IEEE OPEN JOURNAL OF INTELLIGENT TRANSPORTATION SYSTEMS
(2022)
Proceedings Paper
Computer Science, Artificial Intelligence
Qi Zhu, Chenyu Gao, Peng Wang, Qi Wu
Summary: This paper discusses the advantages of a simple attention mechanism in OCR text-related tasks, splitting OCR features into visual and linguistic attention branches and sending them to a Transformer decoder to generate answers or captions. The baseline model performs strongly, outperforming state-of-the-art models on two popular benchmarks and surpassing the TextCaps Challenge 2020 winner.
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE
(2021)
Proceedings Paper
Computer Science, Artificial Intelligence
Zhaokai Wang, Renda Bao, Qi Wu, Si Liu
Summary: Reading text in the visual scene is crucial to understanding key information when describing an image. This study introduces a Confidence-aware Non-repetitive Multimodal Transformer (CNMT) to read OCR tokens and generate accurate descriptions.
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE
(2021)