☆ 4.7 Article

Dual-Level Representation Enhancement on Characteristic and Context for Image-Text Retrieval

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2022)

Journal

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

Volume 32, Issue 11, Pages 8037-8050

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TCSVT.2022.3182426

Keywords

Semantics; Visualization; Feature extraction; Correlation; Learning systems; Task analysis; Filtration; Dual-level feature enhancement; multi-block matching; image-text retrieval

Funding

National Natural Science Foundation of China [U21B2024]
National Key Research and Development Program of China [2021YFF0704003]
Baidu Program

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper proposes a dual-level representation enhancement network (DREN) to improve image-text retrieval. By exploring characteristics and contexts of regions and words in a joint manner, accurate matching of image-text pairs is achieved, leading to superior retrieval performance.

Image-text retrieval is a fundamental and vital task in multi-media retrieval and has received growing attention since it connects heterogeneous data. Previous methods that perform well on image-text retrieval mainly focus on the interaction between image regions and text words. But these approaches lack joint exploration of characteristics and contexts of regions and words, which will cause semantic confusion of similar objects and loss of contextual understanding. To address these issues, a dual-level representation enhancement network (DREN) is proposed to strength the characteristic and contextual representations by innovative block-level and instance-level representation enhancement modules, respectively. The block-level module focuses on mining the potential relations between multiple blocks within each instance representation, while the instance-level module concentrates on learning the contextual relations between different instances. To facilitate the accurate matching of image-text pairs, we propose the graph correlation inference and weighted adaptive filtering to conduct the local and global matching between image-text pairs. Extensive experiments on two challenging datasets (i.e., Flickr30K and MSCOCO) verify the superiority of our method for image-text retrieval.

Authors

I am an author on this paper

Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7

Not enough ratings

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Deep Relation Embedding for Cross-Modal Retrieval

Yifan Zhang, Wengang Zhou, Min Wang, Qi Tian, Houqiang Li

Summary: Cross-modal retrieval is achieved through a Cross-modal Relation Guided Network (CRGN) for measuring the similarity between images and text sentences. By learning global feature guiding and sentence generation, the relation between image regions is modeled, leading to efficient retrieval between image and text.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2021)