Topic models over spoken language

Niketan Pansare; Chris Jermaine; Peter Haas; Nitendra Rajput

doi:10.1109/ICDM.2012.90

ICDM 2012

Conference paper

01 Dec 2012

Topic models over spoken language

View publication

Abstract

Virtually all work on topic modeling has assumed that the topics are to be learned over a text-based document corpus. However, there exist important applications where topic models must be learned over an audio corpus of spoken language. Unfortunately, speech-to-text programs can have very low accuracy. We therefore propose a novel topic model for spoken language that incorporates a statistical model of speech-to-text software behavior. Crucially, our model exploits the uncertainty numbers returned by the software. Our ideas apply to any domain in which it would be useful to build a topic model over data in which uncertainties are explicitly represented. © 2012 IEEE.

Conference paper