Article
Computer Science, Artificial Intelligence
Yifan Zhang, Wengang Zhou, Min Wang, Qi Tian, Houqiang Li
Summary: Cross-modal retrieval is achieved through a Cross-modal Relation Guided Network (CRGN) for measuring the similarity between images and text sentences. By learning global feature guiding and sentence generation, the relation between image regions is modeled, leading to efficient retrieval between image and text.
IEEE TRANSACTIONS ON IMAGE PROCESSING
(2021)
Article
Engineering, Electrical & Electronic
Keyu Wen, Xiaodong Gu, Qingrong Cheng
Summary: In this work, a novel multi-level semantic relations enhancement approach named DSRAN is proposed to address the issue of mismatch between regional features and global features in image-text matching. DSRAN consists of two modules, performing graph attention for region-level relations enhancement and regional-global relations enhancement simultaneously. The experimental results show that DSRAN outperforms previous approaches by a large margin, demonstrating the effectiveness of the dual semantic relations learning scheme.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
(2021)
Article
Computer Science, Artificial Intelligence
Jiangtong Li, Liu Liu, Li Niu, Liqing Zhang
Summary: The MEMBER method introduces global memory banks to enable fine-grained alignment and fusion between images and texts in embedding learning paradigm, achieving mutual embedding enhancement and maintaining retrieval efficiency. Extensive experiments show that MEMBER outperforms state-of-the-art approaches on two large-scale benchmark datasets.
IEEE TRANSACTIONS ON IMAGE PROCESSING
(2021)
Article
Engineering, Electrical & Electronic
Yang Liu, Hong Liu, Huaqiu Wang, Mengyuan Liu
Summary: This paper presents a contrastive visual semantic embedding framework to address the problem of semantic misalignment in image-text matching, achieving intra-modal and inter-modal semantic alignment through contrastive learning, and achieving state-of-the-art results on large-scale datasets.
IEEE SIGNAL PROCESSING LETTERS
(2022)
Article
Computer Science, Artificial Intelligence
Yang Liu, Hong Liu, Huaqiu Wang, Fanyang Meng, Mengyuan Liu
Summary: This article proposes a bidirectional correct attention network (BCAN) to solve the semantic misalignment problem in cross-modal retrieval. It introduces a relevance concept between subfragments and the semantics of the entire images or sentences, and designs a correct attention mechanism that models local and global similarity.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
(2023)
Article
Computer Science, Theory & Methods
Osman Tursun, Simon Denman, Sabesan Sivapalan, Sridha Sridharan, Clinton Fookes, Sandra Mau
Summary: This paper shows that the ranking accuracy of trademark retrieval systems can be significantly improved by incorporating hard and soft attention mechanisms, which has practical significance.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY
(2022)
Article
Computer Science, Artificial Intelligence
Ya Jing, Wei Wang, Liang Wang, Tieniu Tan
Summary: This paper introduces a model called Graph Attentive Relational Network (GARN) to learn aligned image-text representations by modeling the relationships between noun phrases in texts. The model achieves state-of-the-art results on multiple benchmark datasets.
IEEE TRANSACTIONS ON IMAGE PROCESSING
(2021)
Article
Engineering, Electrical & Electronic
Hong Lan, Pufen Zhang
Summary: This study proposes a novel Multi-Level Matching Network (MLMN) for measuring the similarity between images and texts. By learning and integrating vector-based multi-level matching features, the proposed method enhances the performance of image-text retrieval and improves interpretability.
IEEE SIGNAL PROCESSING LETTERS
(2022)
Article
Computer Science, Software Engineering
Johannes Knittel, Steffen Koch, Thomas Ertl
Summary: PyramidTags is a new approach to visually summarizing large text collections, incorporating both temporal evolution and semantic relationship of visualized tags. It provides analysts with a starting point for interactive exploration to grasp important ideas and stories.
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS
(2021)
Article
Computer Science, Information Systems
Yaxiong Wang, Hao Yang, Xiuxiu Bai, Xueming Qian, Lin Ma, Jing Lu, Biao Li, Xin Fan
Summary: The proposed method in this paper introduces a novel position focusing attention network to investigate the relation between visual image and textual views, enhancing the joint-embedding learning by integrating object positions and a position attention mechanism. Experiments conducted on Flickr30K, MS-COCO, and Tencent-News datasets have shown competitive performance of the proposed method.
IEEE TRANSACTIONS ON MULTIMEDIA
(2021)
Article
Engineering, Electrical & Electronic
Yan Wang, Yuting Su, Wenhui Li, Jun Xiao, Xuanya Li, An-An Liu
Summary: In this paper, the authors propose a novel Dual-path Rare Content Enhancement Network (DRCE) to tackle the long-tail problem in image and text matching. They introduce Cross-modal Representation Enhancement (CRE) and Cross-modal Association Enhancement (CAE) to construct a dual-path structure that enhances the representation and association of rare content using cross-modal prior knowledge. The authors also propose an Adaptive Fusion Strategy (AFS) to effectively fuse complementary cross-modal relation and an alternative re-ranking strategy (ARR) to refine matching results using reciprocal contextual information.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
(2023)
Article
Computer Science, Information Systems
Huaxin Pang, Shikui Wei, Gangjian Zhang, Shiyin Zhang, Shuang Qiu, Yao Zhao
Summary: Composed Image Retrieval (CIR) is a method that combines a reference image and text feedback to search for specific images. It offers a more comprehensive understanding of users' search intent and improves the accuracy of target image retrieval, making it crucial for various real-world applications such as E-commerce and Internet search. However, due to the existing semantic gap between image and text, it is challenging to achieve a synthetic understanding and fusion of both modalities. In this work, an end-to-end framework called MCR is proposed to address this challenge, utilizing both text and images as retrieval queries. The framework consists of four pivotal modules that effectively bridge the semantic gap and learn complementary representations for composed queries, achieving superior performance compared to state-of-the-art algorithms in various benchmark tests.
IEEE TRANSACTIONS ON MULTIMEDIA
(2023)
Article
Computer Science, Artificial Intelligence
Shu-Juan Peng, Yi He, Xin Liu, Yiu-ming Cheung, Xing Xu, Zhen Cui
Summary: Fine-grained image-text retrieval is a hot research topic that aims to bridge the gap between vision and languages. The main challenge lies in learning the semantic correspondence across different modalities. Existing methods mainly focus on learning global semantic correspondence or intra-modal relation correspondence, and often overlook the importance of intermodal relations. To address this issue, we propose a relation-aggregated cross-graph (RACG) model that explicitly learns fine-grained semantic correspondence by aggregating both intra-modal and intermodal relations.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
(2022)
Article
Engineering, Electrical & Electronic
Zejun Liu, Fanglin Chen, Jun Xu, Wenjie Pei, Guangming Lu
Summary: Cross-modal image-text retrieval is an important task in Vision-and-Language, which aligns image-text pairs by embedding features into a shared space. Current approaches use weighted combinations for inter-modal alignment and intra-modal relationship modeling. However, the same item contributes differently in these processes, leading to semantic changes and misalignment. To address this, this paper introduces Cross-modal Semantic Importance Consistency (CSIC), which achieves semantic invariance by measuring importance and improving representation through inter-calibration. Experiments on Flickr30K and MS COCO datasets demonstrate the superiority and rationality of our proposed approach.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
(2023)
Article
Computer Science, Artificial Intelligence
Guoshuai Zhao, Chaofeng Zhang, Heng Shang, Yaxiong Wang, Li Zhu, Xueming Qian
Summary: Despite extensive research on bidirectional image-text matching, the challenge remains due to the semantic gap between visual and textual modalities. Most existing methods focus only on visual object features and ignore the semantic attributes of detected regions. To address this issue, we propose a generative multiattribution tag fusion method with region attribution, which effectively bridges the semantic gap.
KNOWLEDGE-BASED SYSTEMS
(2023)