With the help of a voice actor, a dynamic script, and AI voice-conversion technology, researchers transform the flat voices of older virtual agents into something more expressive.
When IBM created its first customer service bots, the actors playing the voices of Allison and Lisa read their lines with the even, detached tone of a news anchor. Text-to-speech technology was in its infancy, and speech-generation models at the time struggled to capture the spontaneity and emotion of spoken language. To compensate, actors were coached to deliver their lines as uniformly as possible.
Two decades later, many synthetic voices can pass for human, thanks to recent breakthroughs in speech processing. The variations in volume, pacing, and intonation that make spoken language so difficult for machines to reproduce have been overcome with the help of speech-generation AI. In a new study at the Interspeech 2022 conference, IBM researchers merge this new technology with old school storytelling and performance techniques to inject empathy and expressiveness into the voices of IBM’s older customer service bots.
“If you’re on a call with customer service, you really want a comforting voice to ease your frustration,” said study co-author, Raul Fernandez, an IBM researcher who focuses on AI speech processing. “That’s the tone we’re aiming for with this new technique to make the legacy agents used by Watson Assistant sound more conversational and empathetic.”
Two decades ago, IBM trained its first text-to-speech systems, and Fernandez was part of the team that drafted what IBM’s virtual agents would say, and coached actors how to say it. Converted to digital form, those voices became Lisa, Allison, and other fictional personas that Watson Assistant later drew on.
“Consistency was the key,” said Fernandez. “We had to try and straitjacket the performances because of the limits of the technology at the time.”
In recent years, deep neural networks have transformed speech processing, and IBM’s customer service voice bots have evolved accordingly. Converted to deep nets, the voices of Allison, Lisa, and the others, today sound more natural. In the new study, IBM researchers show that these voices can be further improved with a professional voice actor, a good script, and next-generation AI speech and voice conversion models.
Recording a new voice with ‘rehearsed spontaneity’
As they set out to update IBM’s legacy voices, the researchers knew that AI enhancements alone wouldn’t be enough. They needed more dynamic conversational training data. So they hired a voice actor, and Fernandez returned to a recording studio in Manhattan — this time with a script color-coded with spontaneous, emotive features, and real customer-service chat logs.
In the studio, they sought to recreate some common scenarios. The voice actor was asked to silently mouth the customer’s lines, for context, and sometimes even say the customer’s lines aloud to mimic the mirroring that happens in real conversation. Each of the actor’s lines was coded with one of 10 expressive labels, from “empathy” to “surprise” to “positive feedback.” Fernandez worked with the actor to vocally shape each category to make it distinct. To add to the script’s improvisational feel, they sprinkled throwaway words like “um,” “huh?” and “aha!” throughout.
“I call it rehearsed spontaneity,” said Fernandez.
To get seven hours of usable recordings, the researchers clocked more than twice as much time in the studio. The result? A voice with the caring qualities of a friend. “Oh, this is confusing!” the agent says, in a line tagged “uncertainty.” “It seems like your password might have expired?” The actor’s voice ends on a high note — a cue to the customer to confirm.
At another point, the agent cheerfully responds: “Oh, don’t apologize! I’m happy to help!” in a line labeled “empathetic.”
An updated architecture to make Allison and Lisa sound more human
Blending this new voice with the old required a new AI model architecture. Older text-to-speech technology based on deep neural networks wasn’t expressive enough, and more recent sequence-to-sequence models that predict speech directly from the text couldn’t reproduce the variety and range of everyday speech.
“Text can be spoken in different ways,” said Slava Shechtman, an IBM researcher focused on generative AI for speech synthesis. “It’s not enough to produce one realistic reading of the text. The product user has to be able to vary the rhythm and tone to have multiple readings of the text.”
In an earlier paper, the team designed a sequence-to-sequence model that lets the user control the expressiveness of the voice, putting emphasis on some words to disambiguate similar sounding words like Austin and Boston. In the current paper, they add a voice conversion model to graft the expressive speaking style of their recently-recorded voice actor with the identity of IBM’s older virtual agents, Allison and Lisa.
To train the voice conversion model, they fine-tune the foundation language model, HuBERT, on the combined recordings of the new expressive voice and the legacy Allison and Lisa voices. In a separate data augmentation step, they convert the expressive recordings into both Allison and Lisa’s distinctive voices. They then train their controllable sequence-to-sequence model on the expressive speech converted to Allison and Lisa’s identity as well as the original Allison and Lisa recordings.
“The model now has the ability to exclaim ‘aha!’ or give the customer an encouraging ‘hmm,’ ” said Shechtman.
When the researchers asked listeners on Amazon’s Mechanical Turk platform whether they preferred the old voices of Allison and Lisa or the updated, blended voices, more than 60 percent chose the computer-generated blended version. The researchers plan to follow up with additional studies to see how users interact with the new versions of Allison and Lisa, and how the voices can be further improved.
Listen to Allison and Lisa before and after voice conversion.