About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICIP 2018
Conference paper
Towards View-Independent Viseme Recognition Based on CNNS and Synthetic Data
Abstract
Visual Speech Recognition is the ability to interpret spoken text using video information only. To address such task automatically, recent works have employed Deep Learning and obtained high accuracy on the recognition of words and sentences uttered in controlled environments, with limited head-pose variation. However, the accuracy drops for multi-view datasets and when it comes to interpreting isolated mouth shapes, such as visemes, the values reported are considerably lower, as shorter segments of speech lack temporal and contextual information. In this work, we evaluate the applicability of synthetic datasets for assisting recognition of visemes in real-world data acquired under controlled and uncontrolled environments, using GRID and AVICAR datasets, respectively. We create two large-scale synthetic 2D datasets based on realistic 3D facial models - with near-frontal and multi-view mouth images. We perform experiments that indicate that a transfer learning approach using synthetic data can get higher accuracy than training from scratch using real data only, on both scenarios.