Detecting breathing sounds in realistic Japanese telephone conversations and its application to automatic speech recognition
Non-verbal sound detection has long attracted attention in the speech analytics field. Although detecting laughter, coughs, and lip smacking has been well studied in the literature, breath-event detection has not been investigated much despite the need for doing so. Breath events are highly correlated with major prosodic breaks, meaning that the positions of breath events can be used as a delimiter of utterances in combination with a voice activity detection (VAD) technique. Silence intervals approximately 20 ms long right before and after breathing sounds, called “edges” are clearly observed in speech signals. In the literature, capturing the edges is shown to be very effective in reducing false alarms in the detection of breath events. However, the edges often disappear when breaths are taken in spontaneous speech. In this work, we focus on the robustness of breath-event detection in spontaneous speech. The breath detection method we have developed leverages acoustic information that is specialized for breathing sounds, leading to a two-step approach that can detect breath events with an accuracy of 97.4%. We also propose splitting unsegmented speech signals into semantically grouped utterances by leveraging the breath events. The speech segmentation based on accurate breath-event detection provided a 3.8% relative error reduction in automatic speech recognition (ASR).