Speech Technologies Seminar 2008
July 02, 2008
Organized by IBM Haifa Research Lab
Program Abstracts
Challenges of Speech Solutions in Call Centers
Nava Shaked, Manager, CRM & Call Center, IBM Israel
Implementing speech-based projects for call centers is a challenging task. We are often asked by our customers to enhance the efficiency of their current systems, while maximizing results with new implementations. We find that there are several challenges to be considered and taken into account, such as:
- Managing the existing infrastructure and legacy systems, while introducing new technologies and processes
- The business positioning of the marketing as opposed to the IT requirements, which are sometimes conflicting
- VUI issues are no less important than the technology itself and can make the difference between a "good" and "bad" system.
- Combining more than one speech technology to handle business situations as well as regulation requirements, for example in speech recognition and speech biometrics. This sounds great but, what is the right way to effectively implement both?
This presentation will discuss these key challenges by illustrating real customer problems and address the issues faced with some solutions. A case will be made to take a more holistic approach when driving speech projects.
Actionable Intelligence via Speech Analytics
Ofer Shochet, Senior VP, VERINT
Over the past six years, speech analytics at the contact center have transitioned from an anecdotal and experimental initiative to a mainstream feature, deployed at hundreds of locations and analyzing the conversations of tens of thousands of customer representatives. In this presentation, we will demonstrate how speech analytics transforms recorded customer interactions from idle data to actionable intelligence. Based on real life experience, we will provide examples of how speech analytics is helping contact centers achieve their business goals.
Discriminative Keyword Spotting
Joseph Keshet, IDIAP
We present s a new approach for keyword spotting, which is not based on HMMs. Unlike previous approaches, the proposed method employs a discriminative learning procedure, in which the learning phase aims at maximizing the area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devised is based on mapping the input acoustic representation of the speech utterance along with the target keyword into a vector space. Building on techniques used for large margin and kernel methods (like SVM and boosting) for predicting whole sequences, our keyword spotter distills to a classifier in this vector-space, which separates speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We describe a simple iterative algorithm for training the keyword spotter and discuss its formal properties. Experiments on read speech with the TIMIT corpus show that our method outperforms the conventional context-independent HMM-based approach. Further experiments using the TIMIT trained model, but tested on both read (HTIMIT, WSJ) and spontaneous speech (OGI-Stories), show that without further training or adaptation to the new corpus, our method outperforms the conventional context-independent HMM-based approach.
Recent Advances in Speech Dereverberation
Emanuel Habets, Bar-Ilan University & Technion
In speech communication systems the received speech signal is degraded by the acoustic channel, ambient noise, and other interferences. This signal degradation can decrease the fidelity and intelligibility of speech and the word recognition
rate of automatic speech recognition systems. While state-of-the-art acoustic signal processing algorithms are available to reduce noise, the development of practical algorithms that can reduce the degradations caused by reverberation has for a long time been one of the holy grails. The main difference between noise and reverberation is that the degrading component in case of reverberation is dependent on the desired signal, whereas in case of noise it can be assumed to be independent of the desired signal. Dereverberation methods can be divided into two categories: reverberation cancelation and reverberation suppression methods. While reverberation cancelation methods equalize the acoustic channel, reverberation suppression methods reduce the degradation of the desired signal caused by the acoustic channel. Until now, reverberation cancelation in a practical scenario remains an unsolved and challenging problem. The main cause is that the acoustic channel is too complex to model in a deterministic way. In recently emerged reverberation suppression methods a statistical model of the acoustic channel is utilized. This model depends on a few characteristics of the acoustic channel, such as the reverberation time and the direct to reverberation ratio. In this talk we will formulate the problem of speech dereverberation and discuss how reverberation affects the speech intelligibility and automatic speech recognition. Subsequently, a short overview of different dereverberation methods will be presented. We then focus on a recently developed suppression method that is used to jointly suppress reverberation and ambient noise. Experimental results using real and simulated reverberation signals show a significant reverberation and noise reduction with perceptually low speech distortion.
On Improving the Quality of Small Footprint Concatenated Text-to-Speech Synthesis Systems
David Malah, Head of Signal and Image Processing Lab, Technion
High quality text to speech (TTS) systems typically use concatenation of speech acoustic units or sub-units (e.g., sub-phonemes) represented by an appropriate set of parameters that are stored in a database (Context Decision-Tree). For close to natural-sounding speech, the database size (footprint) is typically very large.
In mobile applications a big footprint is prohibitive and there have been efforts in recent years to drastically reduce the footprint with minimal reduction in synthesized speech quality. One such approach, used by IBM-HRL, is to apply pre-selection of the most often used sub-units in the decision tree. This approach suffers from spectral discontinuities when needed acoustic units are not available in the reduced database. Other works apply statistical models that are learned from the speech database and have a small footprint, but suffer from over-smoothing that results in muffled and buzzy speech. To improve the quality of concatenated TTS (CTTS) systems with reduced footprint, obtained by pre-selection, we considered the following two approaches: (i) Combining the advantages of CTTS with those of statistical TTS (STTS), by constructing a hybrid TTS (HTTS) system, in which natural and model-based (statistical) segments are optimally (in terms of a suitable cost function) interweaved within each utterance. (ii) Improving the representation of speech data and its storing efficiency, via compression, so that for the same footprint size, more and better-represented speech sub-units could be included.
This talk will mainly describe the development of the proposed HTTS system, including novel improvements in its STTS part. The second approach is presently in its early stages, so only the main directions we plan to follow will be outlined.
The two approaches for improving small footprint TTS systems to be presented in this talk are under development by Stas Tiomkin and Tamar Shoham, respectively, in the framework of their MSc research, under the supervision of the speaker, and in cooperation with the IBM HRL speech technologies group, managed by Ron Hoory.
Keynote. Superhuman Speech Recognition: Technology Challenges and Market Adoption
David Nahamoo, IBM Fellow, Speech CTO and Business Strategist, IBM Watson Research Center
Speech recognition has come a long way since the early days of small vocabulary prototypes of 50 years ago. Techniques based on human expertise have given way to machine trainable data driven approaches. Championed by IBM Research in early 70s, Hidden Markov Models, N-Gram Language Models, Maximum Likelihood and Discriminative training techniques are now widely adopted. Even with successful deployment of the technology in the automotive and contact center self service markets, the technology challenges of automatic speech recognition are far from being solved. We are still faced with challenges such as robustness to noise, accent, and dialect variations as well as robustness to speech quality variation in spontaneous speech such as in a television interview program or a contact center conversation. Having fallen short of solving the problem raises a few fundamental questions: When can we expect to surpass human performance? Do we need to innovate fundamentally different approaches to the problem? When would the technology reach a diminished return on investment for performance improvement? In this talk, we will address the progress, the challenges, and the market adoption of speech recognition with emphasis on transcription and analytics.
Intra-class Variability Modeling for Speech Processing
Hagai Aronowitz, IBM Haifa Research Lab
Most speech recognition related tasks may be formulated as classification tasks. Until recently, the state-of-the-art approach for speech classification was based on an assumption that an observed segment is actually a sequence of short frames drawn independently from a class-dependent distribution which is modeled by a Gaussian mixture model.
Our approach is to represent every speech segment as a distribution over the frame-space, and to model every class as a prior distribution over the segment-distribution-space.
We describe how this approach leads to improved speaker recognition and speaker diarization algorithms.
Retrieving Spoken Information by Combining Multiple Speech Transcription Methods
Jonathan Mamou, IBM Haifa Research Lab
The rapidly increasing amount of spoken data such as broadcast news, telephone and contact center conversations, and roundtable meetings, calls for solutions to index and search this data.
The classical approach consists of converting the speech to word transcripts using large vocabulary continuous speech recognition tools and extending classical Information Retrieval techniques to word transcripts. A significant drawback of this approach is that search on queries containing Out-Of-Vocabulary (OOV) terms will not return any result. An approach for solving the OOV issue consists of converting the speech to subword (phones, syllables, or word-fragments) transcripts. The retrieval is based on searching the sequence of subwords representing the query in the subword transcripts. However, such subword approach suffers from low accuracy.
In this talk, we present a novel method for vocabulary independent retrieval by merging search results obtained from search on word and phonetic transcripts. The word and phonetic transcripts are both indexed and combined during the query processing.
For in-vocabulary query terms, our system uses word confusion networks (WCNs) generated by the speech recognizer. By taking the word alternatives provided by the WCNs and the terms' confidence levels into consideration, our system is able to improve the search effectiveness. We also show improvement in phonetic retrieval by fuzzy search using fail-fast edit distance computing and by phonetic query expansion.
This approach combining word transcripts and phonetic transcripts is guaranteed to outperform other approaches using only word index or phonetic index.
Poster Abstracts
Improvements and Generalizations of the SVM Re-scoring Algorithm of Continuous HMMs
Amir Alfandary and David Burshtein (Tel-Aviv University)
The support vector machine (SVM) re-scoring algorithm of hidden Markov models (HMMs), that was recently proposed by Sloin and Burshtein, is extended in the following aspects. First, extended variable to fixed-length data transformations are proposed. Second, the algorithm is extended to continuous speech recognition. Third, further connections are established with the Fisher kernel.
Real Time Transcription Services
Shay Ben-David (IBM Haifa Research Lab)
Real Time Translation Services (RTTS) enables two-way, free form speech translation that assists human communication for people who do not share a common language. It may run on an embedded device such as PDA, or as a service which is accessed by a communication device (e.g., smartphone or a regular phone). The user speaks into a microphone that interfaces with RTTS. RTTS uses large vocabulary continuous speech recognition to convert the speech to a textual representation, which is translated to the target language. The translated text is synthesized to speech for the foreign language speaker to hear. RTTS deals with many technological challenges in speech recognition, translation and text-to-speech. High accuracy is achieved by constraining the subject of the conversation to specific domains, such as tourism and travel.
Persay Submission for NIST Speaker Recognition Evaluation 2008
Ran Gazit, Nir Krause and Gennady Karvitsky (Persay)
This poster describes PerSay's system for NIST 2008 speaker recognition evaluation. This system achieved state-of-the-art results on telephone speech. PerSay's system consists of an SVM classifier running on Gaussian super vectors, which first go through nuisance attribute projection (NAP SGM). We will explain the algorithms used and will show the results achieved.
Hybrid Approach for Age Detection
Ron M Hecht, Ruth Aloni-Lavi, Gil Dobry, Amir Alfandary, Navot Akiva and Ido Abramovich (Pudding Media)
The human speech production mechanism is a four stage process. The speech production process begins with a message formulation stage. During that stage the notion that one wants to transmit is materialized into a sequence of words. That stage is followed by the language code stage. The latter adds the prosody and converts the sequence of words to a sequence of phonemes.
The last two stages are the neuron-muscular controls stage and the vocal tract system stage. The output of those stages is an acoustic wave form. The age of the speaker affects each of those levels. In our work, we applied a hybrid approach to age estimation. We exploited some of those effects. In particular, we focused on two main outputs: the acoustic signal and the word sequence. We estimated the speaker age according to each of those sources.
In addition, we merged those two approaches into a hybrid approach and thus achieved more accurate performance.
Gaussian Information Bottleneck Method for Speaker Recognition
Ron M Hecht (Tel-Aviv University), Elad Noor (Weizmann Institute) and Naftali Tishby (Hebrew University)
We explore a novel approach to the extraction of relevant information for speaker recognition. An approach that uses a principled information theoretic framework - the information bottleneck method (IB). The goal of this approach is to preserve mostly relevant information in the speech signal about the speaker's identity. In this work, we focus on a specific case of IB called the Gaussian Information Bottleneck (GIB). This case assumes that both the source and goal variables are high dimensional multivariate Gaussian variables. In addition, we explored the use of Linear Discriminant Analysis (LDA) for that task and showed that it is a special case of GIB formulation. The GIB information curve enables us to quantify the tradeoff between relevant and irrelevant information, which is a measure for the difficulty of the classification task. By applying GIB for the speaker recognition task, using the NIST SRE dataset, we were able to significantly boost the recognition performance. Our baseline system was a 128 Gaussian TUP/supervector system based on warped MFCC features, and had an EER score of 14.9%. Applying GIB to that system decreased the EER to only 6.5% on the same task.
What Does it Feel Like to Lose? Emotions Elicited in a Voice Operated Gambling Game
Noam Amir (Tel Aviv University)
We have designed a novel computer controlled environment that elicits emotions in subjects while they are uttering short identical phrases. The paradigm is based on Damasio's experiment for eliciting apprehension and is implemented in a voice activated computer game. For six subjects, we obtained recordings of dozens of identical sentences, which are coupled to events in the game - gain or loss of points. Prosodic features of the recorded utterances were extracted and classified. The resultant classifier gave 78-85% recognition of presence/absence of apprehension.
Embedded TTS Development at IBM Haifa Research Lab
Zvi Kons, Slava Shechtman, Yael Erez, and Asaf Rendel (IBM Haifa Research Lab)
We present three different embedded Text-to-Speech systems which were all developed in the IBM Haifa Research lab. Each of these systems address different requirements of quality and footprint. The first two systems are concatenative systems with two different speech codecs. They address small (5MB) to medium (300MB) footprints. The third one is a statistical TTS system (also known as HMM TTS), which is still in the research stage. The footprint of this system will be very small (<5MB).
Reverberation Matching for Speaker Recognition
Itay Peer, Boaz Rafaely, and Yaniz Zigel (Ben-Gurion University)
Speech recorded by a distant microphone in a room may be subject to reverberation. Performance of a speaker verification system may degrade significantly for reverberant speech, with severe consequences in a wide range of real applications. This paper presents a comprehensive study of the effect of reverberation on speaker verification, and investigates approaches to reduce the effect of reverberation: training target models with reverberant speech signals and using acoustically matched models for the reverberant speech under test, score normalization methods to improve the reverberation robustness, and also reverberation classification via the background model scores. Experimental investigation is performed, using simulated and measured room impulse responses, NIST-based speech database, and AGMM based speaker verification system, showing significant improvement in performance.
The Effect of GMM Order and CMS on Speaker Recognition with Reverberant Speech
Noam R. Shabtai, Yaniv Zigel, and Boaz Rafaely (Ben-Gurion University)
Speaker recognition is used today in a wide range of applications. The presence of reverberation, in hands-free systems for example, results in performance degradation.
The effect of reverberation on the feature vectors and its relation to optimal GMM order are investigated. Optimal model order is calculated in terms of minimum BIC and KIC, and tested for EER of a GMM-based speaker recognition system. Experimental results show that for high reverberation time, reducing model order reduces EER values of speaker recognition.
The effect of CMS on state of the art GMM and AGMM-based speaker recognition systems is investigated for reverberant speech. Results show that high reverberation time reduces the effectiveness of CMS.
Enhancing Speaker Recognition with Virtual Examples
Yosef Solewicz (Israeli National Police) and Hagai Aronowitz (IBM Haifa Research Lab)
Support vector machines (SVMs) combined with Gaussian mixture models (GMMs) using universal background models (UBMs) have recently emerged as the state-of-the-art approach to speaker recognition. Typically, linear kernel SVMs are defined in a space in which speakers are represented by supervectors. A supervector is formed by stacking the Maximum-A-Posteriori (MAP) adapted means of the UBM, given the speaker data, so that a whole speaker conversation is condensed into a single point in the supervector space. Due to the limitations in target, as opposed to impostor data, this framework leads to highly imbalanced training. Virtual examples (VEs) refers to the creation of artificial examples generated from the original labeled ones and is one of the proposed solutions to alleviate the imbalanced training problem. It has been successfully applied in tasks such as text and handwriting recognition. In this work, we present preliminary results obtained using VEs in the context of 4-wire and 2-wire speaker recognition.
Speech Acquisition and Processing in HERMES EU FP7 Project
Alex Sorin, Hagai Aronowitz, and Jonathan Mamou (IBM Haifa Research Lab)
The HERMES project is aimed at developing a personal assistive system alleviating aging-related cognitive decline by providing "external memory" and cognitive training based on recording and analysis of audio/visual information. The system acquires information automatically and semi-automatically by a combination of in-home audio/video sensing infrastructure and a mobile device. The spectrum of services provided to the user includes exploring past experience, time/location/situation dependent reminders and cognitive training. The expertise of HERMES consortium partners spans gerontology, user experience and HCI, audio/video processing, data mining, mobile applications and SW engineering. IBM is responsible for the speech processing including speaker recognition, speech transcription, speech-based emotion recognition and spoken information retrieval. Elderly speech and far-field audio recoding in a variable and uncontrolled acoustic environment pose significant challenges to the speech processing tasks. Capabilities stemming from the personal nature of the HERMES system will be explored to address these challenges. It includes speaker specific modeling and automatic unsupervised adaptation on the data progressively accumulated by the system.
Iterative Learning & Applications to Meta Data Extraction
Roman Talyansky and Alon Itay (Technion)
We present Iterative Learning - a machine learning approach that we developed to solve problems in the Natural Language Processing domain. First Iterative Learning is defined, and then two of its applications are discussed: Meta Data Extraction in speech and Named Entity Recognition in text. Within Meta Data Extraction we use prosodic features only to extract prosodic phenomena such as speech unit boundaries, interruption points, filled pauses, and discourse markers. We show that given a baseline solution of Meta Data Extraction or Named Entity Recognition problems, application of iterative learning reduces the relative error of the baseline solution by 30%.