Improving CNN-Based Viseme Recognition Using Synthetic Data

Andrea Brittomattos; Dário Augusto Borges Oliveira; Edmilson Dasilva Morais

doi:10.1109/ICME.2018.8486470

ICME 2018

Conference paper

08 Oct 2018

Improving CNN-Based Viseme Recognition Using Synthetic Data

View publication

Abstract

Recently, Deep Learning-based methods have obtained high accuracy for the problem of Visual Speech Recognition. However, while good results have been reported for words and sentences, recognizing shorter segments of speech, like phones, has proven to be much more challenging due to the lack of temporal and contextual information. In this work, we address the problem of recognizing visemes, that are the visual equivalent of phonemes-the smallest distinguishable sound unit in a spoken word. Viseme recognition has application in tasks such as lip synchronization, but acquiring and labeling a viseme dataset is complex and time-consuming. We tackle this problem by creating a large-scale synthetic 2D dataset based on realistic 3D facial models, automatically labelled. Then, we extract real viseme images from the GRID corpus-using audio data to locate phonemes via forced phonetic alignment and the registered video to extract the corresponding visemes-and evaluate the applicability of the synthetic dataset for recognizing real-world data.

Conference paper