Abstract
Conventional keyword search (KWS) systems for speech databases match the input text query to the set of word hypotheses generated by an automatic speech recognition (ASR) system from utterances in the database. Hence, such KWS systems attempt to solve the complex problem of ASR as a precursor. Training an ASR system itself is a time-consuming process requiring transcribed speech data. Our prior work presented an ASR-free end-to-end system that needed minimal supervision and trained significantly faster than an ASR-based KWS system. The ASR-free KWS system consisted of three subsystems. The first subsystem was a recurrent neural network based acoustic encoder that extracted a finite-dimensional embedding of the speech utterance. The second subsystem was a query encoder that produced an embedding of the input text query. The acoustic and query embeddings were input to a feedforward neural network that predicted whether the query occurred in the acoustic utterance or not. This paper extends our prior work in several ways. First, we significantly improve upon our previous ASR-free KWS results by nearly 20% relative through improvements to the acoustic encoder. Next, we show that it is possible to train the acoustic encoder on languages other than the language of interest with only a small drop in KWS performance. Finally, we attempt to predict the location of the detected keywords by training a location-sensitive KWS network.