Speech Technologies

We specialize in speech and multimodal AI for interaction, analytics, media, and security applications. Using advanced research and development, we create AI algorithms, technology components, solutions, and services – to enhance the experience and capabilities offered to enterprises, mobile users, application and content developers.

Our current focus is on Spoken Customer Care, where our goals are to improve customer experience in human-machine interaction (voice bots), and to enhance analytics capabilities in human-human call center interaction.

Our expertise covers a wide spectrum of technologies – for expressive speech synthesis, voice customization, spoken language understanding and generation, speaker recognition, language identification, speech-based emotion recognition, and audiovisual analytics.

With our vast expertise in speech signal processing, machine learning, and deep learning, we support IBM product teams and participate in research activities that advance speech science and technology.


Ron Hoory, Manager Speech Technologies, IBM Research - Haifa

Ron Hoory,
Manager Speech Technologies,
IBM Research - Haifa



State-of-the-art Text-to-Speech (TTS) technology with high naturalness and expressiveness for delivering information and interacting with enterprise customers.

    With many years of experience in TTS, we continuously innovate and contribute to future versions of the TTS service on the IBM Watson Developer Cloud. Our current research focus is on:
  • Harnessing deep learning to close the gap between synthesized speech and natural human speech
  • Adaptation of the TTS voice model to imitate a target speaker
  • Controlling synthesized speech expressions, emphasis, and emotions while preserving high quality

Speaker Diarization

Speaker diarization ("who spoke when") is a critical step for analytics of multi-speaker audio streams, such as a single channel call center recording, TV broadcasts or recorded meetings. Our research technology reaches beyond state-of-the-art accuracy using advanced deep learning and clustering techniques.


Spoken dialogue

Advanced AI for enhancing customer care voice interaction ("Voice Bot").

    Our research is focused on the following aspects:
  • Improving the spoken language understanding by the voice bot, using advanced methods for detecting intents and entities from the user's speech
  • Improving the voice bot's spoken responses naturalness and expressiveness
  • User experience evaluation of voice bots, including impact of technology enhancements

Past Activities

Audiovisual Analytics

Advanced AI for audio and audiovisual analytics to generate automatic closed captioning and video enrichment.

Our research on speaker analysis includes speaker change detection, speaker diarization, and speaker identification, based on either the audio track alone or a combination between the audio and visual tracks. We are also developing audio event detection to identify speech, music, sirens, gunshots, applause, cheering, and other types of audio.


IBM Virtual Voice Creator

Text to Speech (TTS) synthesis technologies have become increasingly natural sounding and expressive, opening up new opportunities in domains such as entertainment and education. The IBM Virtual Voice Creator vision of customizable voice generation for game characters, cartoon heroes, and engaging conversational agents becomes a reality using high quality, expressive TTS with customizable on-line voice transformations and an interactive voice design web studio. We’re developing an inexpensive, fast, repeatable, and flexible voice-over process that encompasses static as well as dynamic and AI-generated textual content.


Mobile Multi-Factor Authentication (MMFA)

The mobile environment presents many security and usability challenges. MMFA utilizes the multitude of sensors and information channels on mobile devices together with our multimodal biometric authentication technology. We’re working to maximize both security and usability levels according to the situation, risk, and environment. Video authentication combines speaker and face verification for high accuracy (0.1% equal error rate), audiovisual liveness detection against replay attacks, and high usability in a short authentication session - several seconds of Selfie video.


Speech-based Emotion Recognition

Technologies and solutions for speech-based emotion recognition that enable analytics of spoken data, as well as for affect aware human computer interaction. Speech-based emotion recognition combines prediction from verbal content (textual transcript) as well as the non-verbal content (by direct signal analysis).