Multi-view mouth renderization for assisting lip-reading
Abstract
Previous work demonstrated that people who rely on lipreading often prefer a frontal view of their interlocutor, but sometimes a profile view may display certain lip gestures more noticeably. This work refers to an assistive tool that receives an unconstrained video of a speaker, captured at an arbitrary view, and not only locates the mouth region but also displays augmented versions of the lips in the frontal and profile views. This is made using deep Generative Adversarial Networks (GANs) trained on several pairs of images. In the training set, each pair contains a mouth picture taken at a random angle and the corresponding picture (i.e., relative to the same mouth shape, person, and lighting condition) taken at a fixed view. In the test phase, the networks are able to receive an unseen mouth image taken at an arbitrary angle and map it to the fixed views - frontal and profile. Because building a large-scale pairwise dataset is time consuming, we use realistic synthetic 3D models for training, and videos of real subjects as input for testing. Our approach is speaker-independent, language-independent, and our results demonstrate that the GAN can produce visually compelling results that may assist people with hearing impairment.