Word level confidence measurement using semantic features
Ruhi Sarikaya, Yuqing Gao, et al.
ICASSP 2003
Deep learning methodologies have had a major impact on performance across a wide variety of machine learning tasks, and speech recognition is no exception. We describe a set of deep learning techniques that proved to be particularly successful in achieving performance gains in word error rate on a popular large vocabulary conversational speech recognition benchmark task ("Switchboard"). We found that the best performance is achieved by combining features from both recurrent and convolutional neural networks. We compare two recurrent architectures: partially unfolded nets with max-out activations and bidirectional long short-term memory nets. In addition, inspired by the success of convolutional networks for image classification, we designed a convolutional net with many convolutional layers and small kernels that create a receptive field with more nonlinearity and fewer parameters than standard configurations. When combined, these neural networks achieve a word error rate of 6.2% on this difficult task; this was the best reported rate at the time of this writing and is even more remarkable given that human performance itself is estimated to be 4% on this data.
Ruhi Sarikaya, Yuqing Gao, et al.
ICASSP 2003
Samuel Thomas, Brian Kingsbury, et al.
ICASSP 2022
Wei Zhang, Xiaodong Cui, et al.
ICASSP 2020
George Saon, Jen-Tzung Chien
IEEE Transactions on Audio, Speech and Language Processing