About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
INTERSPEECH - Eurospeech 2003
Conference paper
Impact of audio segmentation and segment clustering on automated transcription accuracy of large spoken archives
Abstract
This paper addresses the influence of audio segmentation and segment clustering on automatic transcription accuracy for large spoken archives. The work forms part of the ongoing MALACH project, which is developing advanced techniques for supporting access to the world?s largest digital archive of video oral histories collected in many languages from over 52000 survivors and witnesses of the Holocaust. We present several audio-only and audio-visual segmentation schemes, including two novel schemes: the first is iterative and audio-only, the second uses audio-visual synchrony. Unlike most previous work, we evaluate these schemes in terms of their impact upon recognition accuracy. Results on English interviews show the automatic segmentation schemes give performance comparable to (exhorbitantly expensive and impractically lengthy) manual segmentation when using a single pass decoding strategy based on speaker-independent models. However, when using a multiple pass decoding strategy with adaptation, results are sensitive to both initial audio segmentation and the scheme for clustering segments prior to adaptation: the combination of our best automatic segmentation and clustering scheme has an error rate 8% worse (relative) to manual audio segmentation and clustering due to the occurrence of "speaker-impure" segments.