About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICSLP 2002
Conference paper
A hybrid HMM/TRAPS model for robust voice activity detection
Abstract
We present three voice activity detection (VAD) algorithms that are suitable for the off-line processing of noisy speech and compare their performance on SPINE-2 evaluation data using speech recognition error rate as the quality metric. One VAD system is a simple HMM-based segmenter that uses normalized log-energy and a degree of voicing measure as raw features. The other two VAD systems focus on frequency-localized temporal information in the speech signal using a TempoRAl PatternS (TRAPS) classifier. They differ only in the processing of the TRAPS output. One VAD system uses median filtering to generate segment hypotheses, while the other is a hybrid system that uses a Viterbi search identical to that used in the HMM segmenter. Recognition on the hybrid HMM/TRAPS segmentation is more accurate than recognition on the other two segmentations by 1% absolute. This difference is statistically significant at a 99% confidence level according to a matched pairs sentence-segment word error test.