Principal style components: Expressive style control and cross-speaker transfer in neural TTS

Alexander Sorin; Slava Shechtman; Ron Hoory

doi:10.21437/Interspeech.2020-1854

INTERSPEECH 2020

Conference paper

25 Oct 2020

Principal style components: Expressive style control and cross-speaker transfer in neural TTS

View publication

Abstract

We propose a novel semi-supervised technique that enables expressive style control and cross-speaker transfer in neural text to speech (TTS), when available training data contains a limited amount of labeled expressive speech from a single speaker. The technique is based on unsupervised learning of a style-related latent space, generated by a previously proposed reference audio encoding technique, and transforming it by means of Principal Component Analysis to another low-dimensional space. The latter space represents style information in a purified form, disentangled from text and speaker-related information. Encodings for expressive styles that are present in the training data are easily constructed in this space. Furthermore, this technique provides control over the speech rate, pitch level, and articulation type that can be used for TTS voice transformation. We present the results of subjective crowd evaluations confirming that the synthesized speech convincingly conveys the desired expressive styles and preserves a high level of quality.

Conference paper