On the importance of event detection for ASR
The performance of modern large vocabulary continuous speech recognition (LVCSR) systems is heavily affected by segment boundaries, proper speaker identification of the segments, as well as removal of spurious data. We propose to use Long Short Term Memory (LSTM) recurrent neural networks to partition audio into speech segments as well as track speaker turns. Additionally, we train an LSTM to also identify music segments. We show that the accurate detection of events, along with removal of silence and music, using our LSTM yields a 9-10% relative improvement in ASR performance. Secondary processing by speaker clustering provides an additional boost in accuracy. Event detection accuracy of the LSTM approach is also described.