Hierarchical variational loopy belief propagation for multi-talker speech recognition
Abstract
We present a new method for multi-talker speech recognition using a single-channel that combines loopy belief propagation and variational inference methods to control the complexity of inference. The method models each source using an HMM with a hierarchical set of acoustic states, and uses the max model to approximate how the sources interact to generate mixed data. Inference involves inferring a set of probabilistic time-frequency masks to separate the speakers. By conditioning these masks on the hierarchical acoustic states of the speakers, the fidelity and complexity of acoustic inference can be precisely controlled. Acoustic inference using the algorithm scales linearly with the number of probabilistic time-frequency masks, and temporal inference scales linearly with LM size. Results on the monaural speech separation task (SSC) data demonstrate that the presented Hierarchical Variational Max-Sum Product Algorithm (HVMSP) outperforms VMSP by over 2% absolute using 4 times fewer probablistic masks. HVMSP furthermore performs onpar with the MSP algorithm, which utilizes exact conditional marginal likelihoods, using 256 times less time-frequency masks. © 2009 IEEE.