26 Apr 2021
Research
8 minute read

Austin or Boston? Making artificial speech more expressive, natural, and controllable

In our recent paper presented at the IEEE Spoken Language Technologies Workshop, we describe a system that can emphasize or highlight certain words to improve the expressiveness of a sentence or help with context ambiguity.

Austin or Boston? Making artificial speech more expressive, natural, and controllable

In our recent paper presented at the IEEE Spoken Language Technologies Workshop, we describe a system that can emphasize or highlight certain words to improve the expressiveness of a sentence or help with context ambiguity.

Did you say you wanted to book a flight to Austin… or Boston?

Even a human would at times struggle to differentiate between the names of these two cities — they do sound quite similar. An AI in a dialog with a user could easily fumble too.

Speech synthesis technology in voice assistants could help, by emulating the type of expressiveness humans naturally deploy in face-to-face communication. In our recent paper “Supervised and Unsupervised Approaches for Controlling Narrow Lexical Focus in Sequence-to-Sequence Speech Synthesis” presented at the IEEE Spoken Language Technologies Workshop in Shenzhen, China, we describe a system that does just that.1

Our system can emphasize or highlight certain words to improve the expressiveness of a sentence — such as That is an excellent idea! or help with context ambiguation in a scenario like the Austin v. Boston one.

Making speech synthesis more expressive

That’s just one of the innovations in our sequence-to-sequence (S2S) synthesis. Part of a current collaboration between the IBM Research AI TTS (Text-to-Speech) team and IBM Watson, it’s aimed at bringing this expressiveness functionality into our TTS service. In recent years, TTS has achieved state-of-the-art performance with the introduction of deep, neural, S2S architectures that provide high-quality outputs approaching the perceptual quality of natural speech.

The main idea is simple: move away from a classical approach that strings together several independently developed modules to a single model that trains all the components in an end-to-end fashion. This choice is effective but comes at a cost. Since different components are no longer responsible for a specific function, it is difficult to intervene into the synthesis process to control a particular aspect of the output.

To solve this problem, we propose using a variant of the multi-speaker Tacotron-2 architecture described in “Natural TTS synthesis by conditioning Wavenet on MEL spectrogram predictions,”2 consisting of an encoder and a decoder mediated by an attention mechanism. This base model (shown by the components labeled B, C, and D, plus the Decoder, in Fig. 1) takes in an input representation of the text (box B; in our case, the phonemes making up a sentence), plus some knowledge about the speaker identity (box D) and encodes them via a combination of convolutional and bidirectional recurrent networks (component C).

The encoded sequence is then sent to the Spectral Decoder that consults with the attention module to figure out how to align the encoded input with the acoustic features of the output waveform.

Our approach to inject controllability into the system is simple. We know that to highlight some words, speakers tend to deviate from the rest of the sentence in terms of acoustic prosodic properties such as speaking rate and fundamental frequency. Take, for example, the sentence that could come up in a dialog with an assistant:

I didn’t understand that quite right. Did you say your name was Greg, or Craig?

If this were a dialog between humans, the speaker could convey the uncertainty of the situation by, for instance, raising the volume and pitch on the highlighted words, articulating them more clearly and slowly, and possibly adding some brief, but perceptible, pauses before them.

To get our speech synthesis system to do the same, we exposed the model during training to a series of acoustic-prosodic parameters extracted from the output training waveforms (Box F; see “Supervised and Unsupervised Approaches for Controlling Narrow Lexical Focus in Sequence-to-Sequence Speech Synthesis”1) This gave the system an opportunity to associate these prosodic inputs with an emphasis on the output side. During inference, when these measures were not available, a separately trained predictor filled them in (component E). To match a desired level of emphasis, the value of these prosodic controls could be boosted by default or user-provided additive offsets (component G).

This way, component F in the architecture “taught” the model during training via acoustic proxies how to create emphasis. It could signal, say, that Craig and Greg need to receive a different type of articulation and prosody. These proxies were then filled in by the predictive module E to achieve the same effect.

Architecture of the sequence-to-sequence speech synthesis model, enhanced with prosodic controls for the realization of emphatic focusArchitecture of the sequence-to-sequence speech synthesis model, enhanced with prosodic controls for the realization of emphatic focus.

Training, training, training

But labeled resources such as training utterances where some words are made more prominent than others are often not available. To address this, we investigated two major factors of variability: the type of access to labeled data required to train the system, and the model’s ability to effectively generate emphasis for a target voice for which no training materials are available.

In the latter case, we distinguished between a matched condition (when the run-time synthesis corresponds to a voice with labeled resources for training), and a transplant condition (when we synthesize from a voice lacking such resources, but which did nonetheless benefit from knowledge transfer via multi-speaker training).

For training, we used four corpora from three US English voices. They were broken down as follows:

  • A set of 10,800 sentences from a male speaker (M1);
  • A set of 1,000 sentences from the same male speaker but where each sentence was expressly generated to contain several (tagged) emphasized words (M1-emph);
  • And two corpora from two distinct female speakers (F1 and F2) containing approximately 17,300 and 11,000 sentences each. Only one speaker had explicitly labeled data, and the amount of such data was relatively small compared to the other corpora.

The evaluation compared four separately trained systems. The first one we called Baseline (NoEmph), a system lacking word-level prosodic control. No emphasis-marking feature was used here used (B-D in Fig. 2). The second one is baseline (Sup), a system with classic supervision. Here, the training material includes the corpus M1-emph and an explicit binary feature encoding the location of emphasis (A-D in Fig. 2).

The third system is PC-Unsup, a fully unsupervised synthesis system that only exploits implicit knowledge about emphasis using variable prosodic controls (say, M1-emph is not included in the training, and the system has no access to any explicit information encoding emphasis location; B-H in Fig. 2). And finally, PC-Hybrid — a hybrid system combining implicit emphasis knowledge (via the variable prosodic controls) with explicit emphasis labels (by including M1-emph, and its labels, in the training; A-H in Fig. 2).

We trained each of the four systems in a multi-speaker framework and evaluated them using about 1,000 votes from listeners recruited through a crowdsourcing platform. Each sentence contained one known emphasized word, and listeners were asked to rate the overall quality and naturalness of each sample and how well they thought the intended word conveyed emphasis.

The Mean Opinion Scores (MOS) on a 5-point scale are summarized in Table 1 and Table 2 for the transplant and matched conditions, respectively.

SystemEmphasis LevelOverall Quality
Baseline (No Emph)2.203.87
Baseline (Sup)3.713.97
PC-Unsup3.583.97
PC-Hybrid4.024.08

Table 1: Mean Opinion Scores for the transplant condition (speaker F1). All pairwise differences are statistically significantly different for emphasis. For quality, {Base (Sup.), PC-Unsup} are statistically equivalent; all other pairwise differences are statistically significantly different.

SystemEmphasis LevelOverall Quality
Baseline (No Emph)2.213.87
Baseline (Sup)4.084.10
PC-Unsup3.353.82
PC-Hybrid3.964.08

Table 2: Mean Opinion Scores for the matched condition (speaker M1). For emphasis, all systems are statistically significantly different from each other. For quality, there are no statistically significant differences between the pairs {Base (NoEmph), PC-Unsup} and {Base (Sup), PC-Hybrid}; all other pairwise differences are significant.

Quality and emphasis control

The results show that all controllable systems — the last three rows of the tables — exhibit a much higher perceptual degree of emphasis than the Baseline (NoEmph) system, with no loss to overall quality, since the remaining systems are statistically better or the same. The Baseline (NoEmph) system, as expected, attained low scores in terms of creating emphasis.

There are differences between the approaches, however: when labeled data is available for a target speaker, our experiments suggest that the fully supervised approach offers the best operating point in terms of both quality and emphasis (Speaker M1; Table 1). But this approach does not generalize as well as the hybrid approach to a new target speaker, lacking labeled data (Speaker F1; Table 2). Combining explicit supervision with the implicit knowledge transmitted by the prosodic-conditioning framework supplements the performance for both attributes when training a multi-speaker model to enable the transfer of knowledge.

We’ve also learned that even lacking any labeled data, the approach provides good quality and emphasis control by boosting the predictions of a fully unsupervised model. This is made easier by our use of readily interpretable controls that can be linked to the task at hand.

The type of emphatic realization we have described here is totally natural for humans. By focusing on selective components of the message, we guide listeners’ attention to specific aspects of the communication, potentially avoiding extra explaining. Equipping voice assistants with such expressive capabilities could help make them more human-like — and also provide a more efficient mechanism for interaction and a more pleasant user experience.

Date

26 Apr 2021

References

  1. Shechtman, S., Fernandez, R. & Haws, D. Supervised and unsupervised approaches for controlling narrow lexical focus in sequence-to-sequence speech synthesis. in 2021 IEEE Spoken Language Technology Workshop (SLT) (IEEE, 2021). 2

  2. Shen, J. et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018).