Learning Temporal Dynamics for Video Super-Resolution: A Deep Learning Approach

Ding Liu; Zhaowen Wang; Yuchen Fan; Xianming Liu; Zhangyang Wang; Shiyu Chang; Xinchao Wang; Thomas S. Huang

doi:10.1109/TIP.2018.2820807

IEEE TIP

Paper

01 Jul 2018

Learning Temporal Dynamics for Video Super-Resolution: A Deep Learning Approach

View publication

Abstract

Video super-resolution (SR) aims at estimating a high-resolution video sequence from a low-resolution (LR) one. Given that the deep learning has been successfully applied to the task of single image SR, which demonstrates the strong capability of neural networks for modeling spatial relation within one single image, the key challenge to conduct video SR is how to efficiently and effectively exploit the temporal dependence among consecutive LR frames other than the spatial relation. However, this remains challenging because the complex motion is difficult to model and can bring detrimental effects if not handled properly. We tackle the problem of learning temporal dynamics from two aspects. First, we propose a temporal adaptive neural network that can adaptively determine the optimal scale of temporal dependence. Inspired by the inception module in GoogLeNet [1], filters of various temporal scales are applied to the input LR sequence before their responses are adaptively aggregated, in order to fully exploit the temporal relation among the consecutive LR frames. Second, we decrease the complexity of motion among neighboring frames using a spatial alignment network that can be end-to-end trained with the temporal adaptive network and has the merit of increasing the robustness to complex motion and the efficiency compared with the competing image alignment methods. We provide a comprehensive evaluation of the temporal adaptation and the spatial alignment modules. We show that the temporal adaptive design considerably improves the SR quality over its plain counterparts, and the spatial alignment network is able to attain comparable SR performance with the sophisticated optical flow-based approach, but requires a much less running time. Overall, our proposed model with learned temporal dynamics is shown to achieve the state-of-the-art SR results in terms of not only spatial consistency but also the temporal coherence on public video data sets. More information can be found in http://www.ifp.illinois.edu/~dingliu2/videoSR/.

Conference paper