4.7 Article

Predicting Diverse Future Frames With Local Transformation-Guided Masking

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCSVT.2018.2882061

关键词

Predictive models; Generators; Task analysis; Visualization; Computational modeling; Complexity theory; Training; Video prediction; diverse future frames; local transformation level; transformation-guided masking; region of interest; video prediction on single frame

资金

  1. Shenzhen Peacock Plan [20130408-183003656]
  2. Shenzhen Key Laboratory for Intelligent Multimedia and Virtual Reality [ZDSYS-201703031405467]
  3. National Natural Science Foundation of China [U-1613209]

向作者/读者索取更多资源

Video prediction is the challenging task of generating the future frames of a video given a sequence of previously observed frames. This task involves the construction of an internal representation that accurately models the frame evolutions, including contents and dynamics. Video prediction is considered difficult due to the inherent compounding of errors in recursive pixel level prediction. In this paper, we present a novel video prediction system that focuses on regions of interest (ROIs) rather than on entire frames and learns frame evolutions at the transformation level rather than at the pixel level. We provide two strategies to generate high-quality ROIs that contains potential moving visual cues. The frame evolutions are modeled with a transformation generator that produces transformers and masks simultaneously, which are then combined to generate the future frame in a transformation-guided masking procedure. Compared with recent approaches, our system is able to generate more accurate predictions by modeling the visual evolutions at the transformation level rather than at the pixel level. Focusing on ROIs avoids a heavy computational burden and enables our system to generate high-quality long-term future frames without severely amplified signal loss. Moreover, our system is able to generate diverse plausible future frames, which is important in many real-world scenarios. Furthermore, we enable our system to perform video prediction conditioned on a single frame by revising the transformation generator to produce motion-centric transformers. We test our system on four datasets with different experimental settings and demonstrate its advantages over recent methods, both quantitatively and qualitatively.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据