Shilei Zhang, Yong Qin
ICASSP 2012
In this paper we present our current work on automatic speaker recognition using keyword-conditioned phone N-gram modeling. We propose the use of contextual information around keywords in modeling a speaker's pronunciation characteristics at a phonetic level. Our approach is to add time margins around keywords when aligning keyword regions with keyword-specific phone events for feature vector generation. Including such additional information by incorporating time margins can capture idiosyncratic pronunciation information and is shown to help our keyword-conditioned phonetic speaker verification system achieve more than 50% (relative) performance improvement. This leads our high-level speaker verification system (i.e., fusion of non-conditioned and keyword-conditioned phonetic speaker verification systems) to currently achieve the best published result for the English 8-conversation enrollment telephony task of the 2008 NIST Speaker Recognition Evaluation for systems utilizing features not based directly on low-level acoustic information. © 2012 IEEE.
Shilei Zhang, Yong Qin
ICASSP 2012
Weizhong Zhu, Jason Pelecanos
ICASSP 2016
John Z. Sun, Kush R. Varshney, et al.
ICASSP 2012
Raul Fernandez, Steve Minnis, et al.
ICASSP 2012