Facial image-to-video translation by a hidden affine transformation
There has been a prominent emergence of work on video prediction, aiming to extrapolate the future video frames from the past. Existing temporal-based methods are limited to certain numbers of frames. In this paper, we study video prediction from a single still image in the facial expression domain, a.k.a, facial image-to-video translation. Our main approach, dubbed AffineGAN, associates each facial image with an expression intensity and leverages an affine transformation in the latent space. AffineGAN allows users to control the number of frames to predict as well as the expression intensity for each of them. Unlike previous intensity-based methods, We derive an inverse formulation to the affine transformation, enabling automatic inference of the facial expression intensities from videos - manual annotation is not only tedious but also ambiguous as people express in various ways and have different opinions about the intensity of a facial image. Both quantitative and qualitative results verify the superiority of AffineGAN over the state of the arts. Notably, in a Turing test with web faces, more than 50% of the facial expression videos generated by AffineGAN are considered real by the Amazon Mechanical Turk workers. This work could improve users' communication experience by enabling them to conveniently and creatively produce expression GIFs, which are popular art forms in online messaging and social networks.