Learning speaker aware offsets for speaker adaptation of neural networks

Leda Sari; Samuel Thomas; Mark Hasegawa-Johnson

doi:10.21437/Interspeech.2019-1788

INTERSPEECH 2019

Conference paper

15 Sep 2019

Learning speaker aware offsets for speaker adaptation of neural networks

View publication

Abstract

In this work, we present an unsupervised long short-term memory (LSTM) layer normalization technique that we call adaptation by speaker aware offsets (ASAO). These offsets are learned using an auxiliary network attached to the main senone classifier. The auxiliary network takes main network LSTM activations as input and tries to reconstruct speaker, (speaker,phone) and (speaker,senone)-level averages of the activations by minimizing the mean-squared error. Once the auxiliary network is jointly trained with the main network, during test time we do not need additional information for the test data as the network will generate the offset itself. Unlike many speaker adaptation studies which only adapt fully connected layers, our method is applicable to LSTM layers in addition to fully-connected layers. In our experiments, we investigate the effect of ASAO of LSTM layers at different depths. We also show its performance when the inputs are already speaker adapted by feature space maximum likelihood linear regression (fMLLR). In addition, we compare ASAO with a speaker adversarial training framework. ASAO achieves higher senone classification accuracy and lower word error rate (WER) than both the unadapted models and the adversarial model on the HUB4 dataset, with an absolute WER reduction of up to 2%.

Conference paper