Video frame synthesis is an important task in computer vision and has drawn great interests in wide applications. However, existing neural network methods do not explicitly impose tensor low-rankness of videos to capture the spatiotemporal correlations in a high-dimensional space, while existing iterative algorithms require hand-crafted parameters and take relatively long running time. In this paper, we propose a novel multi-phase deep neural network Transform-Based Tensor-Net that exploits the low-rank structure of video data in a learned transform domain, which unfolds an Iterative Shrinkage-Thresholding Algorithm (ISTA) for tensor signal recovery. Our design is based on two observations: (i) both linear and nonlinear transforms can be implemented by a neural network layer, and (ii) the soft-thresholding operator corresponds to an activation function. Further, such an unfolding design is able to achieve nearly real-time at the cost of training time and enjoys an interpretable nature as a byproduct. Experimental results on the KTH and UCF-101 datasets show that compared with the state-of-the-art methods, i.e., DVF and Super SloMo, the proposed scheme improves Peak Signal-to-Noise Ratio (PSNR) of video interpolation and prediction by 4.13 dB and 4.26 dB, respectively.