4.7 Article

Learning for Video Compression With Recurrent Auto-Encoder and Recurrent Probability Model

期刊

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/JSTSP.2020.3043590

关键词

Video compression; Image coding; Decoding; Entropy; Correlation; Motion compensation; Estimation; Deep learning; recurrent neural network; video compression

资金

  1. ETH Zurich General Fund (OK)
  2. Amazon AWS Grant

向作者/读者索取更多资源

This paper introduces a Recurrent Learned Video Compression (RLVC) approach with Recurrent Auto-Encoder (RAE) and Recurrent Probability Model (RPM) to better utilize the temporal correlation among video frames, achieving state-of-the-art compression performance.
The past few years have witnessed increasing interests in applying deep learning to video compression. However, the existing approaches compress a video frame with only a few number of reference frames, which limits their ability to fully exploit the temporal correlation among video frames. To overcome this shortcoming, this paper proposes a Recurrent Learned Video Compression (RLVC) approach with the Recurrent Auto-Encoder (RAE) and Recurrent Probability Model (RPM). Specifically, the RAE employs recurrent cells in both the encoder and decoder. As such, the temporal information in a large range of frames can be used for generating latent representations and reconstructing compressed outputs. Furthermore, the proposed RPM network recurrently estimates the Probability Mass Function (PMF) of the latent representation, conditioned on the distribution of previous latent representations. Due to the correlation among consecutive frames, the conditional cross entropy can be lower than the independent cross entropy, thus reducing the bit-rate. The experiments show that our approach achieves the state-of-the-art learned video compression performance in terms of both PSNR and MS-SSIM. Moreover, our approach outperforms the default Low-Delay P (LDP) setting of x265 on PSNR, and also has better performance on MS-SSIM than the SSIM-tuned x265 and the slowest setting of x265. The codes are available at https://github.com/RenYang-home/RLVC.git.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

Article Computer Science, Artificial Intelligence

Binaural SoundNet: Predicting Semantics, Depth and Motion With Binaural Sounds

Dengxin Dai, Arun Balajee Vasudevan, Jiri Matas, Luc Van Gool

Summary: This work develops an approach for scene understanding purely based on binaural sounds, which can predict the semantic masks, motion, and depth of sound-making objects. By leveraging cross-modal distillation and spatial sound super-resolution, the performance of auditory perception tasks is significantly improved. Experimental results show good performance in all tasks, mutual benefits between tasks, and importance of microphone quantity and orientation.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2023)

Article Engineering, Electrical & Electronic

Advancing Learned Video Compression With In-Loop Frame Prediction

Ren Yang, Radu Timofte, Luc Van Gool

Summary: In recent years, there has been an increasing interest in end-to-end learned video compression. Previous works focused on compressing motion maps to exploit temporal redundancy. However, they did not fully utilize historical information in sequential reference frames. This paper proposes an Advanced Learned Video Compression (ALVC) approach with an in-loop frame prediction module, which effectively predicts the target frame from previously compressed frames. The experiments demonstrate the state-of-the-art performance of ALVC in learned video compression.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2023)

Article Computer Science, Artificial Intelligence

A Survey on Deep Learning Technique for Video Segmentation

Tianfei Zhou, Fatih Porikli, David J. Crandall, Luc Van Gool, Wenguan Wang

Summary: Video segmentation is crucial in various practical applications such as enhancing visual effects in movies, understanding scenes in autonomous driving, and creating virtual background in video conferencing. Deep learning-based approaches have shown promising performance in video segmentation. This survey comprehensively reviews two main research lines - generic object segmentation and video semantic segmentation - by introducing their task settings, background concepts, need, development history, and challenges. Representative literature and datasets are also discussed, and the reviewed methods are benchmarked on well-known datasets. Open issues and opportunities for further research are identified, and a public website is provided to track developments in this field: https://github.com/tfzhou/VS-Survey.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2023)

Article Engineering, Electrical & Electronic

MASIC: Deep Mask Stereo Image Compression

Xin Deng, Yufan Deng, Ren Yang, Wenzhe Yang, Radu Timofte, Mai Xu

Summary: In this paper, a novel network model called MASIC is proposed for stereo image compression. It achieves higher compression efficiency and quality through the introduction of a mask prediction module and mask conditional stereo entropy model.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2023)

Article Computer Science, Artificial Intelligence

PDC-Net plus : Enhanced Probabilistic Dense Correspondence Network

Prune Truong, Martin Danelljan, Radu Timofte, Luc Van Gool

Summary: Establishing accurate correspondences between images is important in computer vision applications, and dense approaches offer an alternative to sparse methods. However, dense flow estimation is often inaccurate in certain cases. In this paper, we propose a network, PDC-Net+, that can estimate accurate dense correspondences along with a reliable confidence map. Our approach learns the flow prediction and uncertainty estimation using a probabilistic approach, achieving state-of-the-art results on various challenging datasets.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2023)

Article Computer Science, Artificial Intelligence

Global Aligned Structured Sparsity Learning for Efficient Image Super-Resolution

Huan Wang, Yulun Zhang, Can Qin, Luc Van Gool, Yun Fu

Summary: This article presents a method called Global Aligned Structured Sparsity Learning (GASSL) to tackle the problem of efficient image super-resolution (SR). The method includes two major components: Hessian-Aided Regularization (HAIR) and Aligned Structured Sparsity Learning (ASSL). GASSL outperforms other recent methods in terms of efficiency, as demonstrated by extensive results.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2023)

Article Automation & Control Systems

Practical Blind Image Denoising via Swin-Conv-UNet and Data Synthesis

Kai Zhang, Yawei Li, Jingyun Liang, Jiezhang Cao, Yulun Zhang, Hao Tang, Deng-Ping Fan, Radu Timofte, Luc Van Gool

Summary: This paper proposes a new approach for image denoising by focusing on network architecture design and training data synthesis. The use of a swin-conv block in the image-to-image translation UNet architecture significantly improves denoising performance. Additionally, a practical noise degradation model is designed to handle various types of noise and resizing, leading to improved practicality. Experimental results demonstrate the effectiveness of the proposed methods.

MACHINE INTELLIGENCE RESEARCH (2023)

Proceedings Paper Computer Science, Artificial Intelligence

Audio-Visual Efficient Conformer for Robust Speech Recognition

Maxime Burchi, Radu Timofte

Summary: This paper proposes an improved Efficient Conformer CTC architecture to address the decline in performance of automatic speech recognition systems in noisy speech. The authors use audio and visual modalities to enhance noise robustness and accelerate training. Experimental results on LRS2 and LRS3 datasets show that the proposed approach achieves state-of-the-art performance with lower WER and faster training.

2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) (2023)

Proceedings Paper Computer Science, Artificial Intelligence

Perceptual Image Enhancement for Smartphone Real-Time Applications

Marcos V. Conde, Florin Vasluianu, Javier Vazquez-Corral, Radu Timofte

Summary: Recent advances in camera designs and imaging pipelines allow us to capture high-quality images using smartphones. However, the limitations of smartphone cameras often result in image artifacts and degradation. Deep learning methods for image restoration can effectively remove these artifacts, but are often not suitable for real-time applications on mobile devices. This paper proposes a lightweight network, LPIENet, for perceptual image enhancement on smartphones. Experimental results demonstrate that our model can handle artifacts and achieve competitive performance with less parameters and operations. Furthermore, the model was deployed directly on commercial smartphones and demonstrated efficient processing of 2K resolution images in under 1 second.

2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) (2023)

Proceedings Paper Computer Science, Artificial Intelligence

Fast Online Video Super-Resolution with Deformable Attention Pyramid

Dario Fuoli, Martin Danelljan, Radu Timofte, Luc Van Gool

Summary: In this work, we propose a recurrent VSR architecture based on a deformable attention pyramid (DAP) to address the VSR problem with strict causal, real-time, and latency constraints. Our DAP aligns and integrates information from the recurrent state into the current frame prediction, overcoming the challenge of unavailable future frame information. By attending to a limited number of spatial locations dynamically predicted by the DAP, we reduce computational cost compared to traditional attention-based methods. Experimental results show the effectiveness of our approach, achieving a significant speed-up and surpassing state-of-the-art methods on benchmark tests.

2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) (2023)

Article Computer Science, Artificial Intelligence

Towards a Robust Deep Neural Network Against Adversarial Texts: A Survey

Wenqi Wang, Run Wang, Lina Wang, Zhibo Wang, Aoshuang Ye

Summary: Deep neural networks have achieved remarkable success in various tasks, but they are vulnerable to adversarial examples in both image and text domains. Adversarial examples in the text domain can evade DNN-based text analyzers and pose threats to the spread of disinformation. This paper comprehensively surveys the existing studies on adversarial techniques for generating adversarial texts and the corresponding defense methods, aiming to inspire future research in developing robust DNN-based text analyzers against known and unknown adversarial techniques.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (2023)

Article Automation & Control Systems

Masked Vision-language Transformer in Fashion

Ge-Peng Ji, Mingchen Zhuge, Dehong Gao, Deng-Ping Fan, Christos Sakaridis, Luc Van Gool

Summary: This paper presents a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation, which replaces the bidirectional encoder representations from Transformers (BERT) with the vision transformer architecture. It is the first end-to-end framework for the fashion domain and includes masked image reconstruction (MIR) for fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that can handle raw multi-modal inputs without extra pre-processing models and shows improvements in retrieval and recognition tasks compared to Kaleido-BERT, the Fashion-Gen 2018 winner.

MACHINE INTELLIGENCE RESEARCH (2023)

Article Computer Science, Artificial Intelligence

EATN: An Efficient Adaptive Transfer Network for Aspect-Level Sentiment Analysis

Kai Zhang, Qi Liu, Hao Qian, Biao Xiang, Qing Cui, Jun Zhou, Enhong Chen

Summary: This paper proposes a novel model called EATN for accurately classifying sentiment polarities towards aspects in multiple domains in sentiment analysis tasks. The model incorporates a Domain Adaptation Module (DAM) to learn common features and uses multiple-kernel selection method to reduce feature discrepancy among domains. Additionally, EATN includes an aspect-oriented multi-head attention mechanism to capture the direct associations between aspects and contextual sentiment words. Extensive experiments on six public datasets demonstrate the effectiveness and universality of the proposed method compared to current state-of-the-art methods.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (2023)

暂无数据