The IBM Virtual Voice Creator takes Text-To-Speech (TTS) synthesis technology to the next level, letting enterprise customers and users create unique voice personas on-demand in a fast and easy way. The IBM Virtual Voice Creator lets you automatically create a voiceover for a multi-character game, animation or educational video, without the hassle of hiring voice actors and audio recording studios.

High-quality expressive TTS that speaks in a multitude of voices created by you

Modern TTS technology approaches human capabilities in terms of speech naturalness and expressiveness. With automation and flexibility, TTS has the potential for more and more applications in domains such as entertainment and education. A fundamental barrier that limits the spread of TTS is that speech synthesis systems can speak in a limited number of voices prepared in advance, typically using an expensive, labor consuming and lengthy process.

Traditionally, each TTS voice is created from a corpus of a single speaker audio recordings. A typical high-quality TTS voice requires 10 – 20 hours of audio data recorded from a voice actor in a professional studio. Actor auditions and recordings could take weeks. Then the recordings are converted to a TTS voice dataset using a complex semi-automatic process. This process typically involves manual inspection and cleaning steps performed by skilled personnel. Hence, this process is time consuming and costly.

The IBM Virtual Voice Creator technology removes this barrier by allowing users to change a TTS voice according to their needs and imagination. An entire universe populated with different human voices, along with exaggerated cartoonish ones, can now be derived from a couple of standard TTS voices.

How the IBM Virtual Voice Creator works

The Virtual Voice Creator is built on top of the IBM Watson TTS technology. This TTS technology employs the unit selection synthesis approach, that, as of today, provides the most natural sound and intonation achievable with modern commercial TTS systems. Watson TTS comes with a set of rich and meticulously cleaned standard voice datasets.

The IBM Virtual Voice Creator adds unique voice transformation capabilities to Watson TTS. We use a sophisticated offline analysis process to prepare the standard voice datasets for transformations that alter voice qualities and perceived speaker identity at synthesis time. The transformations modify various aspects of the voice components associated with the key organs of human speech production mechanism: the vocal folds and vocal tract.

The following speech samples demonstrate the effects of individual voice modifications.

low high monotonous dynamic short altered long source slow fast GT low GT high breath low breath high

The solution’s web GUI studio allows users to configure the voice transformations in a simple and fast way. The user simply selects a standard voice as a basis, and can change it by controlling the vocal tract size and shape, tone, glottal tension, breathiness and speed. All this is facilitated by immediate audio feedback. The user can then store the transformation configuration and the standard voice reference as a new virtual voice, and use it in the future to synthesize any text.

The entire solution, including the virtual voice design and synthesis components, is delivered as a cloud service.

The IBM Virtual Voice Creator R&D team is working on enriching the voice transformations repertoire and enhancing speech expressiveness.

An example use case – video game voiceover automation

Voices in games, especially in role playing and adventure game genres, are vital for the gamer experience. That’s why game developers have started using professional voice actors on a regular basis.

However, discussions on game developers’ forums are going around such questions as: Why are so many games not fully voiced? Why are game dialogs often presented as text bubbles?

    The reasons are rooted in the costly and cumbersome legacy voiceover process, the only phase where the game developer depends on human actors and media capturing. The answers typically given in these forums include:
  • It is expensive including the cost of the actors and a professional sound studio.
  • You will have to do a casting with many actors, which takes a lot of time.
  • To change a line in a script, you need to re-hire the actor (hope he got time in his schedule), and get him back into the studio.
  • When the voice actor for an important character has a different obligation, your whole downloadable content project might have to be cancelled.
  • When the text is generated dynamically, the voiceover is not a viable option.

Using the IBM Virtual Voice Creator, developers can design the voice personas using an automated interactive creative process. This resembles the one they currently use to design the visual appearance of their characters. The text scripts are automatically converted to speech. The scripts can be changed and re-synthesized as often as needed. Dynamic texts can be synthesized to speech as well.