Speech Technology Seminar 2003

Israel

Home | Products & services | Support & downloads | My account

Select a country
IBM Research Home
IBM Haifa Labs Home
IBM Haifa Labs Leadership Seminars
Speech Technology Seminar 2003
Invitation
Preliminary Program
Visitors information
Confirmed participants
Speech Technology in HRL

Feedback

Abstracts

State of the Art Large Vocabulary Speech Recognition: The MALACH Project (Abstract)
Michael Picheny, IBM T.J. Watson Research Center

The principal goal of the Multilingual Access to Large spoken ArCHives (MALACH) project is to develop methods for dramatically improved access to large multilingual spoken archives. Our primary focus is on the enormous multimedia digital archive of the Survivors of the Shoah Visual History Foundation (VHF). This archive contains over 116,000 hours of interviews with over 52,000 survivors, liberators, rescuers, and witnesses of the Shoah, recorded in 32 languages. Four thousand of the interviews in English have been manually cataloged at great expense, but this only represents a small fraction of the total amount of information in the archive. Automating the cataloguing process is the only practical way to provide access to such an archive at reasonable cost. However, today's automatic technologies have relatively limited capabilities - capabilities that must be dramatically enhanced if the full potential of digital archiving is to be realized. In this project, we seek to make just such a leap. Some of the research challenges include enhancing the current transcription technology to robustly handle emotional and heavily accented, elderly speech in multiple languages and the development of automated indexing, cataloguing, and retrieval methods that can cope with relatively unstructured interviews and stream-of-consciousness narratives. MALACH is an NSF-funded joint effort with VHF, John Hopkins University and the University of Maryland. IBM's role in this project is to develop robust speech recognition and information retrieval technologies in English. We will present a more detailed overview of MALACH, state-of-the-art speech recognition and information retrieval results, and an analysis of the technical challenges ahead. We will also describe the background and status of IBM's research into "Superhuman" speech recognition - an attempt to create robust speech recognition systems across different domains, channels, and environments.

Embedded Speech Recognition for Mass-market Mobile Devices (Abstract)
Eran Aharonson, CEO, ART Advanced Recognition Technologies Inc.

This talk describes issues concerning the integration of advanced speech recognition into mass-market mobile devices such as cellular handsets, communicators, smartphones, and automotive telematics equipment. The huge cellular handset market-over 400 million are sold each year-has the greatest potential for speech recognition use (according to a 2002 Kelsey Group report); yet in order to integrate advanced speech recognition into mass-market handsets, manufacturers need to cope with a variety of difficulties that published research has not necessarily covered. The main challenge is the integration of a high-quality speech solution in a low-cost device, such as a cellular handset. This challenge encompasses the following dimensions and more:

The device's battery life, which limits the available computation power.
The device's price, which dictates the use of minimal memory resources and low-cost parts (such as microphones).
The environment, which is usually noisy, such as a street or moving car.
The speech recognition front end, which must utilize CELP input, as many cellular basebands do not provide access to the original PCM.
Voice user interface-the combination of voice interaction and a graphical user interface into one working, user-friendly UI.
We will present examples such as the Xelibri 3 keypadless handset from Siemens. This handset recognizes multiple languages and combines continuous speaker-independent and speaker-dependent technologies to provide a fully voice-operated user interface.

Concatenative Text-to-Speech for the Embedded Environment (Abstract)
Ron Hoory, IBM Haifa Research Lab

In recent years, major progress has been made in the quality and naturalness of text-to-speech (TTS) systems. Today, high quality speech can be produced by concatenative synthesis systems, where speech segments are selected from a large speech database and concatenated. To obtain high quality synthesized speech, a large amount of speech data is required. The large database size is required in order to cover as many phonetic contexts and as many acoustic environments as possible, and hence, to avoid discontinuities at the concatenation points and reduce the amount of prosodic modifications. The large size, which can reach 500MB, is inhibitive for embedded environments.
This talk will describe IBM's concatenative text-to-speech system and the work done for reducing its footprint, to make it suitable for an embedded environment. The approach uses a spectral acoustic feature-based speech representation for computing a cost function during segment selection as well as for speech generation.

Distributed Speech Recognition - Standardization Activity (Abstract)
Alex Sorin, Haifa Research Labs

The low bit rate coding of speech signals negatively affects the accuracy of automatic speech recognition. This imposes a limitation on the usability of speech enabled services, such as interactive voice response (IVR) services, being accessed over mobile networks. Distributed Speech Recognition (DSR) technology overcomes this problem by extracting recognition features from a mobile client device and transmitting them in a compressed form to a speech recognition server. The recognition accuracy is practically unaffected by the feature coding and transmission. The necessity to ensure compatibility between the client side feature extraction and the server side recognition engine back-end led to DSR standardization. In the years 2000 and 2002, the European Telecommunication Standards Institute (ETSI) standardized two DSR codecs.
Unlike usual speech codecs, the ETSI DSR codecs are unable to reconstruct the speech signal. Playback capability at the server side is a vitally important function for some applications. IBM Haifa Research Labs and Motorola Labs have jointly developed and proposed to ETSI an extension of the DSR standards. This technology provides the capability of speech reconstruction from extended DSR features and enhances the recognition capability of tonal languages, e.g. Mandarin. The extended DSR standard proposals are undergoing the formal approval process in ETSI.

We will present the algorithmic aspects and the results of evaluations of the extended DSR technology.

Challenges in Speaker-Independent Name Dialing for Cell-Phones (Abstract)
Adoram Erell, Speech Research Lab, Intel Wireless Communication and Computing Group

Many of today's cell-phones have a speaker-dependent voice-tags feature, which works well for those who take the trouble to voice-enroll very few phone-book entries and do not forget which names they have enrolled. It would be much more convenient to have a speaker-independent name-dialing feature, where any name in the phone book can be seamlessly voice-activated without pre-enrollment. In this talk, we'll describe the challenges we faced while developing such a technology for Intel cellular processors.

Speaker Verification with Optimal Feature Space (Abstract)
Arnon Cohen, Ben-Gurion University

Speaker verification and identification systems most often employ HMMs and GMMs as recognition engines. This paper describes an algorithm for the optimal selection of the feature space suitable for these engines. In verification systems, each speaker (target) is assigned an "individual" optimal feature space in which he/she is best discriminated from imposters. A Dynamic Programming (DP) algorithm is used for the selection process. Several suitable criteria, correlated with the recognition error, were developed and evaluated. The procedure allows the optimization of the verification system for various applications, such as security systems or "convenience" systems. The algorithm was evaluated on a text-dependent database. A significant improvement in verification results was demonstrated with the DP selected individual feature space. An EER of 4.8% was achieved when the feature set was the "almost standard" Mel Frequency Cepstrum Coefficients (MFCC) space (12 MFCC + 12 DMFCC). Under the same conditions, a system based on the selected feature space yielded an EER of only 2.7%.

Speaker Verification by Adapted Phoneme Gaussian Mixture Models (Abstract)
Yuval Bistritz, Tel-Aviv University

Despite intuitive expectation and experimental evidence that phonemes contain useful speaker discrimination information, phoneme-based speaker recognition systems reported so far have not been found to perform better than phoneme-independent speaker recognition systems that successfully use Gaussian Mixture Models (GMM's). This talk describes new phoneme-based speaker verification systems that do sustain the expectation that a phoneme-based speaker verification system should outperform a corresponding phoneme-independent verification system. The speaker verification system in this paper consists of modelling phonemes for each speaker using GMM's that are created by a Bayesian adaptation of a phoneme-independent GMM built using the whole training data, to an extent that depends on the the amount of data available on each phoneme for each speaker. The new speaker verification systems were tested on clean and telephone speech databases. They consistently outperformed comparable phoneme-independent GMM-based systems in all the experiments that were held. Further improvement of performance was noted by only using the adaptation of selected subsets of the most discriminative parameters and phonemes. This work is based on the M.Sc thesis of Dan Gutman supervised by Yuval Bistritz.

Speaker Detection in Multi-Speaker Sessions (Abstract)
Ran Gazit, Persay

Speaker detection is the task of detecting the presence of a specific speaker in a given audio session with an unlimited number of speakers. This paper describes several methods for speaker detection, including external segmentation, internal segmentation, and sliding single-speaker verification. Experiment results over various databases are presented and the application of speaker detection in a call monitoring center is demonstrated.

Speech Coding at Very Low Bit Rates (Abstract)
David Malah ¹, Technion, Haifa

There is continued interest in the bit-rate reduction of speech coders for wireless communication, Internet transmission, storage, etc. In this talk, we present our activity in recent years in developing very low bit rate speech coders (2000 to 600 bps).
The basic idea pursued is that instead of fixed length short analysis frames (matched to the typical duration of quasi-stationary speech), longer segments could be used, where appropriate. This idea led to a series of three projects in which this concept is applied in three different ways.
In the first approach, a long frame of fixed length (135 msec) is divided into a set of six subframes, where only the parameters of either a part of the subframes or of merged subframes are quantized and transmitted (a total of three parameter sets). Skipped, unskipped, and merged subframes, within a long frame, are selected using joint quantization and segmentation, taking in account spectral-domain interpolation errors of skipped subframes. Dynamic programming is used to minimize reconstruction errors (using a log-spectral distance measure). Using a MELP-based excitation model, a coder operating at 1200 bps, with a performance close to the MELP-2400 coder, was developed.
In the second approach, a Long Term Model (LTM) of voiced speech is utilized. According to this model (developed earlier in our lab), long segments (100 160 msec) of voiced speech are warped so as to obtain a signal with a fixed pitch cycle, but with a time-varying waveform in each cycle. The warping function and the evolution of the waveform from one cycle to the next, using prototype waveforms (as in the WI coder), is efficiently represented and coded, resulting in a variable-rate coder operating at an average rate of 2 kbps.
In the third approach, still under development, temporal decomposition (TD) is used to segment the speech signal into variable length segments (of up to 300 msec) according to phonetic events. The spectral-envelope information of each long segment is then represented by a small number of event vectors (line spectral frequencies - LSF) and an event function. This function is used to interpolate LSF vectors at regular intervals from the selected event vectors. The event vectors and the event function are vector quantized. A harmonic-and-noise excitation model is being considered for generating the excitation (similar to the MBE coder), but using the TD model again for effectively representing and coding the excitation parameters. The potential of this coder is to achieve a rate of 600 bps at a similar quality to the 1200 bps coder discussed above.

¹ The three coders to be presented in this talk were developed by Ronen Mayrench, Orit Lev, and Slava Shechtman, correspondingly, in the framework of their M.Sc. research, under the supervision of the speaker.

Speech Enhancement Based on the General Transfer Function GSC and Postfiltering (Abstract)
Sharon Gannot and Israel Cohen, Technion, Haifa

In speech enhancement applications, microphone array postfiltering allows additional reduction of noise components at a beamformer output. Among microphone array structures, the recently proposed General Transfer Function Generalized Sidelobe Canceller (TF-GSC) has shown impressive noise reduction abilities in a directional noise field, while still maintaining low speech distortion. However, in a diffused noise field, less significant noise reduction is obtainable. The performance is degraded even further when the noise is nonstationary. In this contribution, we present three postfiltering methods for improving the performance of microphone arrays. Two of these are based on single-channel speech enhancers and making use of recently proposed algorithms concatenated to the beamformer output. The third is a multi-channel speech enhancer, which exploits noise-only components constructed within the TF-GSC structure. An experimental study, which consists of both objective and subjective evaluation in various noise fields, demonstrates the advantage of the multi-channel postfiltering compared to the single-channel techniques.

Autodirective Dual Microphone and Other Alango DSP Technologies (Abstract)
Alexander Goldin, CEO, Alango

The presentation is devoted to Autodirective Dual Microphone Technology (ADM) and its integration with other Alango digital signal processing technologies.
Autodirective Dual Microphone is a proprietary digital signal processing technology developed in Alango Ltd. This technology creates an optimal directional microphone from two omnidirectional microphones. The technology is adaptive so that the best possible signal-to-noise ratio is ensured in every condition. The adaptation is very fast, so that ADM forgets its past in about 5 milliseconds. This provides exceptional performance in cases where non-stationary interferences are coming from different directions.

ADM technology also solves two main problems of traditional directional microphones: it is much less sensitive to wind noises and it does not have a proximity effect. ADM may have three modes of operation:

Far talk end-fire. In this mode, ADM attenuates sounds that are in the rear hemisphere relative to the axis connecting two microphones.
Close talk end-fire. In this mode, ADM attenuates all sounds that originate far from the front microphone.
Broadside when the preferred direction is at a right angle to the microphone axis. ADM provides very good output sound quality without any signal distortion. The technology is suitable for mass market as well as hi-fi application.
ADM technology is very well integrated with other Alango digital signal processing technologies such as Stationary Noise Suppression, Speech Harmonics Enhancement & Restoration, Multiband Dynamic Range Reduction, and Automatic Gain Control. They all share the same subband analysis/synthesis scheme as well as some auxiliary internal computations.
During the presentation, some pre-recorded off-line demo files will be played. ADM technology will also be demonstrated in a real time demo application.


About IBM \| Privacy \| Terms of use \| Contact