☆ 4.7 Article

Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2022)

期刊

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

卷 32, 期 8, 页码 5680-5694

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TCSVT.2022.3150959

关键词

Streaming media; Representation learning; Feature extraction; Visualization; Task analysis; Electronic mail; Aggregates; Cross-modal retrieval; video-text retrieval; video representation learning; preview-aware attention

类别

Engineering, Electrical & Electronic

资金

National Key Research and Development Program of China [2018YFB1404102]
NSFC [62172420, 61902347, 61976188]
Public Welfare Technology Research Project of Zhejiang Province [LGF21F020010]
Research Program of Zhejiang Laboratory [2019KD0AC02]
Open Projects Program of National Laboratory of Pattern Recognition
Fundamental Research Funds for the Provincial Universities of Zhejiang

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper presents a Reading-strategy Inspired Visual Representation Learning (RIVRL) method for video retrieval. It uses both previewing and intensive-reading branches to extract fine-grained features. Experimental results show that the proposed model achieves state-of-the-art performance on multiple datasets.

This paper aims for the task of text-to-video retrieval, where given a query in the form of a natural-language sentence, it is asked to retrieve videos which are semantically relevant to the given query, from a great number of unlabeled videos. The success of this task depends on cross-modal representation learning that projects both videos and sentences into common spaces for semantic similarity computation. In this work, we concentrate on video representation learning, an essential component for text-to-video retrieval. Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos, which consists of two branches: a previewing branch and an intensive-reading branch. The previewing branch is designed to briefly capture the overview information of videos, while the intensive-reading branch is designed to obtain more in-depth information. Moreover, the intensive-reading branch is aware of the video overview captured by the previewing branch. Such holistic information is found to be useful for the intensive-reading branch to extract more fine-grained features. Extensive experiments on three datasets are conducted, where our model RIVRL achieves a new state-of-the-art on TGIF and VATEX. Moreover, on MSR-VTT, our model using two video features shows comparable performance to the state-of-the-art using seven video features and even outperforms models pre-trained on the large-scale HowTo100M dataset. Code is available at https://github.com/LiJiaBei-7/rivrl.

Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval

期刊

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval

期刊

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文