Publication
INTERSPEECH 2012
Conference paper

Inverting the point process model for fast phonetic keyword search

Abstract

Normally, we represent speech as a long sequence of frames and model the keyword with a relatively small set of parameters commonly with a hidden Markov model (HMM). However since the input speech is much longer than the keyword suppose instead that we represent the speech as a relatively sparse set of impulses (roughly one per phoneme) and model the keyword as a filter-bank where each filter's impulse response relates to the likelihood of a phone at a given position within a word. Evaluating keyword detections can then be seen as a convolution of an impulse train with an array of filters. This view enables huge speedups runtime no longer depends on the frame rate and is instead linear in the number of events (impulses). We apply this intuition to redesign the runtime engine behind the point process model for keyword spotting. We demonstrate impressive real-time speedups (500,000x faster than real-time) with minimal loss in search accuracy.

Date

01 Dec 2012

Publication

INTERSPEECH 2012

Authors

Share