Publication
SIGIR 2007
Conference paper

Vocabulary independent spoken term detection

View publication

Abstract

We are interested in retrieving information from speech data like broadcast news, telephone conversations and roundtable meetings. Today, most systems use large vocabulary continuous speech recognition tools to produce word transcripts; the transcripts are indexed and query terms are retrieved from the index. However, query terms that are not part of the recognizer's vocabulary cannot be retrieved, and the recall of the search is affected. In addition to the output word transcript, advanced systems provide also phonetic transcripts, against which query terms can be matched phonetically. Such phonetic transcripts suffer from lower accuracy and cannot be an alternative to word transcripts.We present a vocabulary independent system that can handle arbitrary queries, exploiting the information provided by having both word transcripts and phonetic transcripts. A speech recognizer generates word confusion networks and phonetic lattices. The transcripts are indexed for query processing and ranking purpose.The value of the proposed method is demonstrated by the relative high performance ofour system, which received the highest overall ranking for US English speech data in the recent NIST Spoken Term Detection evaluation. Copyright 2007 ACM.

Date

Publication

SIGIR 2007

Authors

Topics

Share