But labeled resources such as training utterances where some words are made more prominent than others are often not available. To address this, we investigated two major factors of variability: the type of access to labeled data required to train the system, and the model’s ability to effectively generate emphasis for a target voice for which no training materials are available.
In the latter case, we distinguished between a matched condition (when the run-time synthesis corresponds to a voice with labeled resources for training), and a transplant condition (when we synthesize from a voice lacking such resources, but which did nonetheless benefit from knowledge transfer via multi-speaker training).
For training, we used four corpora from three US English voices. They were broken down as follows:
The evaluation compared four separately trained systems. The first one we called Baseline (NoEmph), a system lacking word-level prosodic control. No emphasis-marking feature was used here used (B-D in Fig. 2). The second one is baseline (Sup), a system with classic supervision. Here, the training material includes the corpus M1-emph and an explicit binary feature encoding the location of emphasis (A-D in Fig. 2).
The third system is PC-Unsup, a fully unsupervised synthesis system that only exploits implicit knowledge about emphasis using variable prosodic controls (say, M1-emph is not included in the training, and the system has no access to any explicit information encoding emphasis location; B-H in Fig. 2). And finally, PC-Hybrid — a hybrid system combining implicit emphasis knowledge (via the variable prosodic controls) with explicit emphasis labels (by including M1-emph, and its labels, in the training; A-H in Fig. 2).
We trained each of the four systems in a multi-speaker framework and evaluated them using about 1,000 votes from listeners recruited through a crowdsourcing platform. Each sentence contained one known emphasized word, and listeners were asked to rate the overall quality and naturalness of each sample and how well they thought the intended word conveyed emphasis.
The Mean Opinion Scores (MOS) on a 5-point scale are summarized in Table 1 and Table 2 for the transplant and matched conditions, respectively.
System | Emphasis Level | Overall Quality |
---|---|---|
Baseline (No Emph) | 2.20 | 3.87 |
Baseline (Sup) | 3.71 | 3.97 |
PC-Unsup | 3.58 | 3.97 |
PC-Hybrid | 4.02 | 4.08 |
Table 1: Mean Opinion Scores for the transplant condition (speaker F1). All pairwise differences are statistically significantly different for emphasis. For quality, {Base (Sup.), PC-Unsup} are statistically equivalent; all other pairwise differences are statistically significantly different.
System | Emphasis Level | Overall Quality |
---|---|---|
Baseline (No Emph) | 2.21 | 3.87 |
Baseline (Sup) | 4.08 | 4.10 |
PC-Unsup | 3.35 | 3.82 |
PC-Hybrid | 3.96 | 4.08 |
Table 2: Mean Opinion Scores for the matched condition (speaker M1). For emphasis, all systems are statistically significantly different from each other. For quality, there are no statistically significant differences between the pairs {Base (NoEmph), PC-Unsup} and {Base (Sup), PC-Hybrid}; all other pairwise differences are significant.
The results show that all controllable systems — the last three rows of the tables — exhibit a much higher perceptual degree of emphasis than the Baseline (NoEmph) system, with no loss to overall quality, since the remaining systems are statistically better or the same. The Baseline (NoEmph) system, as expected, attained low scores in terms of creating emphasis.
There are differences between the approaches, however: when labeled data is available for a target speaker, our experiments suggest that the fully supervised approach offers the best operating point in terms of both quality and emphasis (Speaker M1; Table 1). But this approach does not generalize as well as the hybrid approach to a new target speaker, lacking labeled data (Speaker F1; Table 2). Combining explicit supervision with the implicit knowledge transmitted by the prosodic-conditioning framework supplements the performance for both attributes when training a multi-speaker model to enable the transfer of knowledge.
We’ve also learned that even lacking any labeled data, the approach provides good quality and emphasis control by boosting the predictions of a fully unsupervised model. This is made easier by our use of readily interpretable controls that can be linked to the task at hand.
The type of emphatic realization we have described here is totally natural for humans. By focusing on selective components of the message, we guide listeners’ attention to specific aspects of the communication, potentially avoiding extra explaining. Equipping voice assistants with such expressive capabilities could help make them more human-like — and also provide a more efficient mechanism for interaction and a more pleasant user experience.
Shechtman, S., Fernandez, R. & Haws, D. Supervised and unsupervised approaches for controlling narrow lexical focus in sequence-to-sequence speech synthesis. in 2021 IEEE Spoken Language Technology Workshop (SLT) (IEEE, 2021). ↩ ↩2
Shen, J. et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018). ↩